# ABOUT
- this notebook:
    - explores the shopee addres elements extraction dataset
- insights:
    - given an address, "setu siung 119 rt 5 1 13880 cipayung" we need to predict
        1. Point of interest 
        2. street name e.g siung
        
    - the solutions can be:
        1. a span/substring of the address
        2. empty i.e does not exist
        3. a complete version e.g xl axi pt tbk -> xl axiata pt tbk


## Large dataset
- the dataseet is quite large with 300000 rows

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(r"C:\Users\tanch\Documents\Coding Competitions\Shopee\Comp 2 address elements extraction\datasets\train.csv\train.csv")
df = df.set_index('id')
df.head()

Unnamed: 0_level_0,raw_address,POI/street
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,jl kapuk timur delta sili iii lippo cika 11 a ...,/jl kapuk timur delta sili iii lippo cika
1,"aye, jati sampurna",/
2,setu siung 119 rt 5 1 13880 cipayung,/siung
3,"toko dita, kertosono",toko dita/
4,jl. orde baru,/jl. orde baru


In [3]:
# split 'POI/street' column into 2 individual columns
df[['POI', 'street']] = df['POI/street'].str.split('/', expand=True)

In [4]:
print("Number of rows: ", len(df))

Number of rows:  300000


## Empty predictions
- quite a number of raw_addresses do not have POI or address

In [5]:
print("Number of empty POI: ",sum(df['POI']==''))
print("Number of NON empty POI: ",sum(df['POI']!=''))

Number of empty POI:  178509
Number of NON empty POI:  121491


In [6]:
print("Number of empty street: ",sum(df['street']==''))
print("Number of NON empty street: ",sum(df['street']!=''))

Number of empty street:  70143
Number of NON empty street:  229857


In [7]:
# no POI exists in these examples
df[df['street']!=''].sample(3)

Unnamed: 0_level_0,raw_address,POI/street,POI,street
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2759,pondok kacang timur jl prima 98a rt 2 10 15226,/jl prima,,jl prima
149506,"geb sari, 79 bambu apus rt 2 rw 5 cipayung",/geb sari,,geb sari
60862,baba sari 1 169,/baba sari 1,,baba sari 1


## normal rows
- the rest are normal rows where the answer is a substring of the raw_address
    - e.g pro evo, gatot subr tanjung pinang timur	<---------  pro evo

In [15]:
poi_is_subtring = df.apply(lambda row: row['POI'] in row['raw_address'], axis=1)
street_is_subtring = df.apply(lambda row: row['street'] in row['raw_address'], axis=1)

In [16]:
print("Number of normal POI rows: ", sum(poi_is_subtring))
print("Number of normal street rows: ", sum(street_is_subtring))

Number of normal POI rows:  253860
Number of normal street rows:  282613


- the following shows normal rows

In [17]:
df.loc[df['POI']!=''][poi_is_subtring][["raw_address","POI"]].sample(5)

  df.loc[df['POI']!=''][poi_is_subtring][["raw_address","POI"]].sample(5)


Unnamed: 0_level_0,raw_address,POI
id,Unnamed: 1_level_1,Unnamed: 2_level_1
181276,"toko fagansya, suta, gebang",toko fagansya
70115,perum griya mukti ciwareng blok b11,perum griya mukti
174241,"warung bakso cak mundik,",warung bakso cak mundik
20239,"the green sukun , bakalankrajan, sukun, malang...",the green
264482,"mts al-gazaliyah, cemp besar",mts al-gazaliyah


## abbreviated rows
- some rows have answers that are abbreviated
    - e.g ud. hus al-b, pasar gede bage, mekar mulya <------------ ud. husein al-bana

In [18]:
print("Number of abbreviated POI rows: ", len(df) - sum(poi_is_subtring))
print("Number of abbreviated street rows: ", len(df) - sum(street_is_subtring))

Number of abbreviated POI rows:  46140
Number of abbreviated street rows:  17387


- the following shows abbreviated rows

In [19]:
df.loc[df['POI']!=''][~poi_is_subtring][["raw_address","POI"]].sample(10)

  df.loc[df['POI']!=''][~poi_is_subtring][["raw_address","POI"]].sample(10)


Unnamed: 0_level_0,raw_address,POI
id,Unnamed: 1_level_1,Unnamed: 2_level_1
159945,"xl axi pt tbk sul agung, no 9",xl axiata pt tbk
196878,toko furni kutosari,toko furniture
126639,"balungbangjaya taman dram indah batu hul, 1611...",taman dramaga indah
183179,"gobers hos, petite kerobokan kelod kuta utara",gobers hostel
124112,"kediaman mar janggr, tunt brin, tuntang",kediaman marpuk janggrengan
112974,perumahan tange resid tanah merah sepatan timur,perumahan tangerang residence
241584,"puskesmas mran 1, raya mran,",puskesmas mranggen 1
20572,komplek bukit padjad pasir impun cikadut cimeu...,komplek bukit padjadjaran
98893,"kan kecam inuman, banjar nantigo inuman",kantor kecamatan inuman
163684,"angkri pak, dr raji,",angkringan pakjek


In [9]:
df[df['POI'].apply(lambda POI: "yaya" in POI)]

Unnamed: 0_level_0,raw_address,POI/street,POI,street
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11,"yaya atohar,",yayasan atohariyah/,yayasan atohariyah,
641,"yayasan gugah nur indon, kel gad, kelapa gadin...",yayasan gugah nurani indonesia/kelapa gad,yayasan gugah nurani indonesia,kelapa gad
998,"yaya pelayanan halieluyah, tebet raya, 30d rw ...",yayasan pelayanan halieluyah/tebet raya,yayasan pelayanan halieluyah,tebet raya
2333,"yayasan al jam, r e martad,",yayasan al jamaah/r e martad,yayasan al jamaah,r e martad
2443,"yayasan bha suci jl. gajah mada ponti, gajah m...",yayasan bhakti suci jl. gajah mada pontianak/g...,yayasan bhakti suci jl. gajah mada pontianak,gajah mada
...,...,...,...,...
295244,"yaya shiddig, kena indah, jati mulyo lowokwaru",yayasan shiddigiyyah/kena indah,yayasan shiddigiyyah,kena indah
296536,"rengasde, no 21 yayasan al ishaqiyah kalangsari",yayasan al ishaqiyah/rengasde,yayasan al ishaqiyah,rengasde
296712,yayasan pengalasan songo???j,yayasan pengalasan songo???jombang/,yayasan pengalasan songo???jombang,
298181,"yayasan seko kall, puri resid, no b 53",yayasan sekolah kallista/puri resid,yayasan sekolah kallista,puri resid


In [12]:
df[df['POI'].apply(lambda POI: "dan bar" in POI)]

Unnamed: 0_level_0,raw_address,POI/street,POI,street
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
72263,"h2o pool dan bar, kar mas sejah,",h2o pool dan bar/kar mas sejah,h2o pool dan bar,kar mas sejah
84162,"mase uma kitc dan bar, kerobokan kelod",mase uma kitchen dan bar/,mase uma kitchen dan bar,
91282,"rm. medan baru krekot, spesi bur punai, kre bu...","rm. medan baru krekot, spesialis burung punai/...","rm. medan baru krekot, spesialis burung punai",krekot bun
126246,"rumpi coffee dan barbershop prof. dr. m. yam sh,",rumpi coffee dan barbershop/prof. dr. m. yam sh,rumpi coffee dan barbershop,prof. dr. m. yam sh
134657,kantor kepala desa pengadan baru,kantor kepala desa pengadan baru/,kantor kepala desa pengadan baru,
145150,"nib raya, no 153 pol medan baru,",polsek medan baru/nib raya,polsek medan baru,nib raya
201883,cipete utara beli mesin poles bekas dan baru h...,beli mesin poles bekas dan baru/haji saidi 1,beli mesin poles bekas dan baru,haji saidi 1
250129,"ritual kitchen dan bar, scie boule barat medang",ritual kitchen dan bar/scie boulevard barat,ritual kitchen dan bar,scie boulevard barat
