<a href="https://colab.research.google.com/github/atlas-github/nih_time_series_nlp/blob/main/nih_time_series_nlp_day3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 09:00 am: Practical session 6

### Split Data into Training and Testing Sets for Time Series
Since time series data has an inherent order, you typically split the data by preserving the time order (i.e., earlier data for training, later data for testing).

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Sample time series data
data = pd.date_range('2020-01-01', periods=100, freq='D')
values = range(100)
df = pd.DataFrame({'date': data, 'value': values})

# Split the data by time index
train_size = int(len(df) * 0.8)  # 80% for training, 20% for testing
train, test = df[:train_size], df[train_size:]

print(f"Train set:\n{train.head()}\n")
print(f"Test set:\n{test.head()}\n")

Train set:
        date  value
0 2020-01-01      0
1 2020-01-02      1
2 2020-01-03      2
3 2020-01-04      3
4 2020-01-05      4

Test set:
         date  value
80 2020-03-21     80
81 2020-03-22     81
82 2020-03-23     82
83 2020-03-24     83
84 2020-03-25     84



### Time Series Cross-Validation Techniques
The `TimeSeriesSplit` function in Scikit-learn can be used for time series cross-validation. It splits the data sequentially, preserving the order of time.

In [None]:
from sklearn.model_selection import TimeSeriesSplit

# Sample data
tscv = TimeSeriesSplit(n_splits=3)

for train_index, test_index in tscv.split(df):
    train, test = df.iloc[train_index], df.iloc[test_index]
    print(f"Train indices: {train_index}, Test indices: {test_index}\n")
    print(f"Train set:\n{train.head()}\n")
    print(f"Test set:\n{test.head()}\n")

Train indices: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24], Test indices: [25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49]

Train set:
        date  value
0 2020-01-01      0
1 2020-01-02      1
2 2020-01-03      2
3 2020-01-04      3
4 2020-01-05      4

Test set:
         date  value
25 2020-01-26     25
26 2020-01-27     26
27 2020-01-28     27
28 2020-01-29     28
29 2020-01-30     29

Train indices: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49], Test indices: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74]

Train set:
        date  value
0 2020-01-01      0
1 2020-01-02      1
2 2020-01-03      2
3 2020-01-04      3
4 2020-01-05      4

Test set:
         date  value
50 2020-02-20     50
51 2020-02-21     51
52 2020-02-22     52
53 2020-02-23     53
54 2020-02-24     54

Train indice

### Rolling Window Cross-Validation
This technique creates a series of rolling windows of training and testing data, which is suitable for time series forecasting tasks.

In [None]:
def rolling_window_cross_validation(df, window_size):
    for i in range(window_size, len(df)):
        train = df[:i]
        test = df[i:i+1]
        print(f"Train size: {len(train)}, Test size: {len(test)}")
        print(f"Train set:\n{train.tail(3)}")
        print(f"Test set:\n{test}\n")

# Example with window size 80
rolling_window_cross_validation(df, window_size=80)

Train size: 80, Test size: 1
Train set:
         date  value
77 2020-03-18     77
78 2020-03-19     78
79 2020-03-20     79
Test set:
         date  value
80 2020-03-21     80

Train size: 81, Test size: 1
Train set:
         date  value
78 2020-03-19     78
79 2020-03-20     79
80 2020-03-21     80
Test set:
         date  value
81 2020-03-22     81

Train size: 82, Test size: 1
Train set:
         date  value
79 2020-03-20     79
80 2020-03-21     80
81 2020-03-22     81
Test set:
         date  value
82 2020-03-23     82

Train size: 83, Test size: 1
Train set:
         date  value
80 2020-03-21     80
81 2020-03-22     81
82 2020-03-23     82
Test set:
         date  value
83 2020-03-24     83

Train size: 84, Test size: 1
Train set:
         date  value
81 2020-03-22     81
82 2020-03-23     82
83 2020-03-24     83
Test set:
         date  value
84 2020-03-25     84

Train size: 85, Test size: 1
Train set:
         date  value
82 2020-03-23     82
83 2020-03-24     83
84 2020-03-2

## 09:45 am: Introduction to NLP data preprocessing

### Tokenization
Tokenize a sentence into words.

In [None]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
sentence = "Natural Language Processing is fascinating."
tokens = word_tokenize(sentence)
print(tokens)

['Natural', 'Language', 'Processing', 'is', 'fascinating', '.']


### Stopword removal

Remove common stopwords from a list of tokens.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure you have downloaded stopwords: nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

sentence = "Natural Language Processing is fascinating."
tokens = word_tokenize(sentence)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

['Natural', 'Language', 'Processing', 'fascinating', '.']


### Stemming
Apply stemming to reduce words to their base form.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "jumps", "easily", "fairly"]
stems = [stemmer.stem(word) for word in words]
print(stems)

['run', 'jump', 'easili', 'fairli']


### Lemmatization

Apply lemmatization to get the base form of words.

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Ensure you have downloaded wordnet: nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
words = ["running", "jumps", "easily", "fairly"]
lemmas = [lemmatizer.lemmatize(word) for word in words]
print(lemmas)

['running', 'jump', 'easily', 'fairly']


### Text Normalization
Convert text to lowercase and remove punctuation.

In [None]:
import string

def normalize_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    return text

text = "Natural Language Processing is fascinating!"
normalized_text = normalize_text(text)
print(normalized_text)

natural language processing is fascinating


## 11:00 am: Text cleaning techniques

### Removing Special Characters and Numbers
Clean a text by removing special characters and numbers, keeping only letters and spaces.

In [None]:
import re

def clean_text(text):
    # Remove special characters and numbers
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)
    return cleaned_text

text = "Hello, World! 2024 #Python @NLP"
cleaned_text = clean_text(text)
print(cleaned_text)

Hello World  Python NLP


### Handling Case Sensitivity
Convert all text to lowercase to handle case sensitivity.

In [None]:
def to_lowercase(text):
    return text.lower()

text = "Natural Language Processing Is Fascinating!"
lowercase_text = to_lowercase(text)
print(lowercase_text)

natural language processing is fascinating!


### Removing Stopwords
Remove common stopwords from a text.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure you have downloaded stopwords: nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

text = "Natural Language Processing is really fascinating and interesting."
filtered_text = remove_stopwords(text)
print(filtered_text)

Natural Language Processing really fascinating interesting .


### Removing Punctuation
Remove punctuation from a text.

In [None]:
import string

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

text = "Hello, World! This is Python NLP."
cleaned_text = remove_punctuation(text)
print(cleaned_text)

Hello World This is Python NLP


### Combine techniques

In [None]:
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure you have downloaded stopwords: nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    words = word_tokenize(text)
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

text = "Hello, World! This is Python NLP, and it is amazing in 2024."
preprocessed_text = preprocess_text(text)
print(preprocessed_text)

hello world python nlp amazing


## 12:00 pm: Practical session 7

In [None]:
#install the gdown library
!pip install gdown



In [None]:
#download dengue csv
import pandas as pd
import gdown

# Replace 'YOUR_FILE_ID' with the actual file ID
file_id = '1F-faNnQoyhdjbuyEVZPHV1cv_h0h5UDl'
url = f'https://drive.google.com/uc?id={file_id}'

# Download the CSV file
gdown.download(url, 'dengue.csv', quiet=False)

# Read the CSV file into a DataFrame
df_dengue = pd.read_csv('dengue.csv')
df_dengue

Downloading...
From (original): https://drive.google.com/uc?id=1F-faNnQoyhdjbuyEVZPHV1cv_h0h5UDl
From (redirected): https://drive.google.com/uc?id=1F-faNnQoyhdjbuyEVZPHV1cv_h0h5UDl&confirm=t&uuid=2a847309-c9be-4477-970c-436186d42bf3
To: /content/dengue.csv
100%|██████████| 168M/168M [00:04<00:00, 41.5MB/s]
  df_dengue = pd.read_csv('dengue.csv')


Unnamed: 0,OBJECTID,NO_KES,TRK_NOTI,TRK_DAFTAR,STATUS_XY,RUNSISDATE,OBJECTID_1,NO_NOTI,NO_KES_1,TRK_ONSET,...,TRK_RAWATA,TRK_DIAGNO,TPT_RAWATA,TRK_PTP,TRK_SRT,TRK_ULV,STATUS_PEN,STATUS_LOK,UJIAN_RAPI,RUNSISDA_1
0,1,2016/246923,2014/12/22,2014/12/28,Dalam Sempadan,2022/09/09,192,2221161,2016/246923,2014/12/19,...,,2014/12/22,KLINIK KERAJAAN,,,,Pihak Berkuasa Tempatan,Bandar,TIADA,2022/08/19
1,2,2016/247869,2014/12/20,2014/12/28,Dalam Sempadan,2022/09/09,129,2218637,2016/247869,2014/12/16,...,2014/12/20,2014/12/20,WAD - HOSPITAL KERAJAAN,2014/12/24,2014/12/23,2014/12/23,KKM,Luar Bandar,TIADA,2022/08/19
2,3,2016/247870,2014/12/22,2014/12/28,Dalam Sempadan,2022/09/09,113,2219970,2016/247870,2014/12/20,...,2014/12/21,2014/12/22,WAD - HOSPITAL KERAJAAN,2014/12/24,2014/12/23,2014/12/23,KKM,Luar Bandar,TIADA,2022/08/19
3,6,2016/248203,2014/12/25,2014/12/28,Dalam Sempadan,2022/09/09,161,2224608,2016/248203,2014/12/15,...,2014/12/15,2014/12/25,WAD - HOSPITAL KERAJAAN,,,,Pihak Berkuasa Tempatan,Bandar,TIADA,2022/08/19
4,7,2016/248205,2014/12/25,2014/12/28,Dalam Sempadan,2022/09/09,18,2224730,2016/248205,2014/12/19,...,2014/12/23,2014/12/23,JABATAN KECEMASAN & TRAUMA (A&E) - HOSPITAL KE...,,,,Pihak Berkuasa Tempatan,Bandar,TIADA,2022/08/19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
294003,527065,2022/3556,2022/02/03,2022/02/04,Dalam Sempadan,2022/09/09,546394,7700384,2022/3556,2022/01/30,...,,2022/02/03,KLINIK KERAJAAN,2022/02/07,,,Pihak Berkuasa Tempatan,Bandar,"NS1 Positif, IgG Negatif, IgM Negatif",2022/08/29
294004,527066,2022/3557,2022/02/02,2022/02/04,Dalam Sempadan,2022/09/09,546432,7699146,2022/3557,2022/01/31,...,2022/02/02,2022/02/02,JABATAN KECEMASAN & TRAUMA (A&E) - HOSPITAL SW...,2022/02/07,,,Pihak Berkuasa Tempatan,Bandar,IgG Positif,2022/08/29
294005,527067,2022/3560,2022/02/03,2022/02/04,Dalam Sempadan,2022/09/09,546406,7693895,2022/3560,2022/02/02,...,2022/02/02,2022/02/03,JABATAN KECEMASAN & TRAUMA (A&E) - HOSPITAL SW...,,,,Pihak Berkuasa Tempatan,Bandar,NS1 Positif,2022/08/29
294006,527068,2022/3561,2022/02/03,2022/02/04,Dalam Sempadan,2022/09/09,546437,7698916,2022/3561,2022/01/31,...,,2022/02/03,KLINIK KERAJAAN,2022/02/05,,,Pihak Berkuasa Tempatan,Bandar,IgG Positif,2022/08/29


In [None]:
#filter to only needed columns
df_dengue_filtered = df_dengue[["NO_KES", "NO_RUMAH", "POSKOD", "LOKALITI", "MUKIM", "DAERAH", "LATITUDE", "LONGITUDE", "STATUS_LOK"]]

df_dengue_filtered

Unnamed: 0,NO_KES,NO_RUMAH,POSKOD,LOKALITI,MUKIM,DAERAH,LATITUDE,LONGITUDE,STATUS_LOK
0,2016/246923,"D-302 KIP,",,BL PJU 6 BU 11 : KUATERS GURU,DAMANSARA,PETALING,3.13143,101.608,Bandar
1,2016/247869,57,45300.0,KG.SG.BURUNG,PANCHANG BEDENA,SABAK BERNAM,3.68780,100.966,Luar Bandar
2,2016/247870,LOT 5887,45300.0,"JLN MASJID, PARIT 1 BARAT",PANCHANG BEDENA,SABAK BERNAM,3.68217,100.988,Luar Bandar
3,2016/248203,NO.24,40000.0,SEKSYEN 7 (TERES D : JLN 7/23 - 7/27),BUKIT RAJA,PETALING,3.07742,101.494,Bandar
4,2016/248205,SAZEAN DEVELOPEMENT SDN BHD NO 48,40200.0,SEKSYEN 22 ( KAW PERINDUSTRIAN A : JLN 22/1 - ...,BUKIT RAJA,PETALING,3.06994,101.556,Bandar
...,...,...,...,...,...,...,...,...,...
294003,2022/3556,"B6-01-01 , APARTMENT PALMA , JALAN DESA RIA ,...",,APPT PALMA BCH,RAWANG,GOMBAK,3.32510,101.536,Bandar
294004,2022/3557,"871 , JALAN E4/2\nTAMAN EHSAN KEPONG\n52100 KU...",52100.0,TMN EHSAN FASA 4 (ZON A),BATU,GOMBAK,3.22250,101.618,Bandar
294005,2022/3560,NO.81 JALAN MELATI 17 TAMAN MELATI\n42600 JENJ...,42600.0,TAMAN MELATI ZON 2,TANJONG 12 (1),KUALA LANGAT,2.87148,101.489,Bandar
294006,2022/3561,"NO.0103 , BLOK SEROJA , TAMAN TUN TEJA .",,APPT SEROJA TTT,RAWANG,GOMBAK,3.29720,101.589,Bandar


In [None]:
#combine No_RUMAH and LOKALITI
df_dengue_filtered["NO_RUMAH_LOKALITI"] = df_dengue_filtered["NO_RUMAH"].astype(str) + " " + df_dengue_filtered["LOKALITI"]
df_dengue_filtered

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dengue_filtered["NO_RUMAH_LOKALITI"] = df_dengue_filtered["NO_RUMAH"].astype(str) + " " + df_dengue_filtered["LOKALITI"]


Unnamed: 0,NO_KES,NO_RUMAH,POSKOD,LOKALITI,MUKIM,DAERAH,LATITUDE,LONGITUDE,STATUS_LOK,NO_RUMAH_LOKALITI
0,2016/246923,"D-302 KIP,",,BL PJU 6 BU 11 : KUATERS GURU,DAMANSARA,PETALING,3.13143,101.608,Bandar,"D-302 KIP, BL PJU 6 BU 11 : KUATERS GURU"
1,2016/247869,57,45300.0,KG.SG.BURUNG,PANCHANG BEDENA,SABAK BERNAM,3.68780,100.966,Luar Bandar,57 KG.SG.BURUNG
2,2016/247870,LOT 5887,45300.0,"JLN MASJID, PARIT 1 BARAT",PANCHANG BEDENA,SABAK BERNAM,3.68217,100.988,Luar Bandar,"LOT 5887 JLN MASJID, PARIT 1 BARAT"
3,2016/248203,NO.24,40000.0,SEKSYEN 7 (TERES D : JLN 7/23 - 7/27),BUKIT RAJA,PETALING,3.07742,101.494,Bandar,NO.24 SEKSYEN 7 (TERES D : JLN 7/23 - 7/27)
4,2016/248205,SAZEAN DEVELOPEMENT SDN BHD NO 48,40200.0,SEKSYEN 22 ( KAW PERINDUSTRIAN A : JLN 22/1 - ...,BUKIT RAJA,PETALING,3.06994,101.556,Bandar,SAZEAN DEVELOPEMENT SDN BHD NO 48 SEKSYEN 22 (...
...,...,...,...,...,...,...,...,...,...,...
294003,2022/3556,"B6-01-01 , APARTMENT PALMA , JALAN DESA RIA ,...",,APPT PALMA BCH,RAWANG,GOMBAK,3.32510,101.536,Bandar,"B6-01-01 , APARTMENT PALMA , JALAN DESA RIA ,..."
294004,2022/3557,"871 , JALAN E4/2\nTAMAN EHSAN KEPONG\n52100 KU...",52100.0,TMN EHSAN FASA 4 (ZON A),BATU,GOMBAK,3.22250,101.618,Bandar,"871 , JALAN E4/2\nTAMAN EHSAN KEPONG\n52100 KU..."
294005,2022/3560,NO.81 JALAN MELATI 17 TAMAN MELATI\n42600 JENJ...,42600.0,TAMAN MELATI ZON 2,TANJONG 12 (1),KUALA LANGAT,2.87148,101.489,Bandar,NO.81 JALAN MELATI 17 TAMAN MELATI\n42600 JENJ...
294006,2022/3561,"NO.0103 , BLOK SEROJA , TAMAN TUN TEJA .",,APPT SEROJA TTT,RAWANG,GOMBAK,3.29720,101.589,Bandar,"NO.0103 , BLOK SEROJA , TAMAN TUN TEJA . APPT..."


In [None]:
#which postcodes have most complete vs. most missing data?
postcode_counts = df_dengue_filtered['POSKOD'].value_counts().reset_index()

# Rename the columns
postcode_counts.columns = ['postcode', 'count']

postcode_counts

Unnamed: 0,postcode,count
0,43000.0,22665
1,68000.0,13900
2,43300.0,11192
3,41200.0,10018
4,43200.0,9536
...,...,...
1098,45950.0,1
1099,57600.0,1
1100,83700.0,1
1101,45320.0,1


In [None]:
#how many NaNs in each column
df_dengue_filtered.isna().sum()

Unnamed: 0,0
NO_KES,0
NO_RUMAH,6
POSKOD,92467
LOKALITI,68
MUKIM,1
DAERAH,0
LATITUDE,0
LONGITUDE,0
STATUS_LOK,27276
NO_RUMAH_LOKALITI,68


In [None]:
import numpy as np

#filter to only missing postcodes
df_missing_postcodes = df_dengue_filtered[df_dengue_filtered["POSKOD"].isna()]
df_missing_postcodes

Unnamed: 0,NO_KES,NO_RUMAH,POSKOD,LOKALITI,MUKIM,DAERAH,LATITUDE,LONGITUDE,STATUS_LOK,NO_RUMAH_LOKALITI
0,2016/246923,"D-302 KIP,",,BL PJU 6 BU 11 : KUATERS GURU,DAMANSARA,PETALING,3.13143,101.608,Bandar,"D-302 KIP, BL PJU 6 BU 11 : KUATERS GURU"
5,2016/248206,LOT 12,,SEKSYEN U19 (KG. PAYA JARAS DALAM),SUNGAI BULOH,PETALING,3.18971,101.544,Bandar,LOT 12 SEKSYEN U19 (KG. PAYA JARAS DALAM)
8,2016/248313,LOT 32840,,SEKSYEN U5 ( TAMAN SEGAR ),BUKIT RAJA,PETALING,3.17433,101.532,Bandar,LOT 32840 SEKSYEN U5 ( TAMAN SEGAR )
21,2016/248607,"NO S2-24A,",,SEKSYEN 7 (PUSAT KOMERSIAL A : JLN AA 7/AA - A...,BUKIT RAJA,PETALING,3.06853,101.491,Bandar,"NO S2-24A, SEKSYEN 7 (PUSAT KOMERSIAL A : JLN ..."
23,2016/248612,NO 47,,SEKSYEN 13 ( TERES B : JLN 13/16 - 13/25 & 13/...,BUKIT RAJA,PETALING,3.08841,101.541,Bandar,NO 47 SEKSYEN 13 ( TERES B : JLN 13/16 - 13/25...
...,...,...,...,...,...,...,...,...,...,...
294000,2022/3553,"NO 8C , JALAN 16 , SELAYANG BARU , 68100 BATU ...",,SELAYANG BARU ZON 15,BATU,GOMBAK,3.24540,101.672,Bandar,"NO 8C , JALAN 16 , SELAYANG BARU , 68100 BATU ..."
294002,2022/3555,(10/20 BLOK TERATAI TAMAN TUN TEJA RAWANG),,APPT TERATAI TTT,RAWANG,GOMBAK,3.29570,101.591,Bandar,(10/20 BLOK TERATAI TAMAN TUN TEJA RAWANG) APP...
294003,2022/3556,"B6-01-01 , APARTMENT PALMA , JALAN DESA RIA ,...",,APPT PALMA BCH,RAWANG,GOMBAK,3.32510,101.536,Bandar,"B6-01-01 , APARTMENT PALMA , JALAN DESA RIA ,..."
294006,2022/3561,"NO.0103 , BLOK SEROJA , TAMAN TUN TEJA .",,APPT SEROJA TTT,RAWANG,GOMBAK,3.29720,101.589,Bandar,"NO.0103 , BLOK SEROJA , TAMAN TUN TEJA . APPT..."


In [None]:
#verify similarity
df_43100 = df_dengue_filtered[df_dengue_filtered["POSKOD"] == 47610]
df_43100

Unnamed: 0,NO_KES,NO_RUMAH,POSKOD,LOKALITI,MUKIM,DAERAH,LATITUDE,LONGITUDE,STATUS_LOK,NO_RUMAH_LOKALITI
18,2016/248602,NO 65,47610.0,USJ 2/2,PETALING,PETALING,3.05908,101.590,Bandar,NO 65 USJ 2/2
389,2016/249670,A6-06-01,47610.0,USJ 1 (ANGSANA APPT),PETALING,PETALING,3.05713,101.783,Bandar,A6-06-01 USJ 1 (ANGSANA APPT)
478,2016/249847,NO 44,47610.0,USJ 4/8 - USJ 4/9,PETALING,PETALING,3.04847,101.575,Bandar,NO 44 USJ 4/8 - USJ 4/9
650,2016/250054,NO 37,47610.0,USJ 6/5 - USJ 6/7,PETALING,PETALING,3.04889,101.588,Bandar,NO 37 USJ 6/5 - USJ 6/7
865,2016/250774,NO 8,47610.0,USJ 6/1,PETALING,PETALING,3.05391,101.590,Bandar,NO 8 USJ 6/1
...,...,...,...,...,...,...,...,...,...,...
293141,2022/2298,63 JALAN USJ 6/2K\nUEP SUBANG JAYA,47610.0,E13. USJ 6/2,7. DAMANSARA (MBSJ),PETALING,3.05216,101.587,Bandar,63 JALAN USJ 6/2K\nUEP SUBANG JAYA E13. USJ 6/2
293239,2022/2441,"27 JLN PINGGIRAN USJ 3/3, TMN PINGGIRAN USJ, S...",47610.0,E13. USJ 3/3,7. DAMANSARA (MBSJ),PETALING,3.04853,101.558,Bandar,"27 JLN PINGGIRAN USJ 3/3, TMN PINGGIRAN USJ, S..."
293436,2022/2720,"17-1,JALAN SIERRA 1/3\nBANDAR 16 SIERRA\nPUCHONG",47610.0,SIERRA 1,DENGKIL,SEPANG,2.97490,101.652,Bandar,"17-1,JALAN SIERRA 1/3\nBANDAR 16 SIERRA\nPUCHO..."
293580,2022/3114,11 JALAN USJ 6/2D\nSUBANG JAYA,47610.0,E13. USJ 6/2,7. DAMANSARA (MBSJ),PETALING,3.05230,101.586,Bandar,11 JALAN USJ 6/2D\nSUBANG JAYA E13. USJ 6/2


In [None]:
!pip install folium



In [None]:
import folium
import pandas as pd

# Initialize the map
m = folium.Map(location=[df_43100['LATITUDE'].mean(), df_43100['LONGITUDE'].mean()], zoom_start=12)

# Add points to the map
for i, row in df_43100.iterrows():
    folium.Marker([row['LATITUDE'], row['LONGITUDE']]).add_to(m)

# Display the map
m

In [None]:
import plotly.graph_objects as go

fig = go.Figure(go.Scattermapbox(
    lat=df_43100['LATITUDE'],
    lon=df_43100['LONGITUDE'],
    mode='markers',
    marker=go.scattermapbox.Marker(size=9),
    text=df_43100['LOKALITI']  # Display the place name when hovering
))

# Define the layout for the map
fig.update_layout(
    mapbox_style="open-street-map",
    mapbox=dict(
        center=dict(lat=3.1390, lon=101.6869),  # Center the map
        zoom=10
    )
)

# Display the map
fig.show()


In [None]:
#sign up on https://opencagedata.com/api

import requests

# Replace YOUR_API_KEY with your actual key
api_key = '###INSERT OPEN CAGE DATA API KEY HERE###'

# a sample query
query = 'Kuala Lumpur, Malaysia'
url = f'https://api.opencagedata.com/geocode/v1/json?q={query}&key={api_key}'

response = requests.get(url)
data = response.json()

# Extract latitude and longitude
latitude = data['results'][0]['geometry']['lat']
longitude = data['results'][0]['geometry']['lng']

print(f'Latitude: {latitude}, Longitude: {longitude}')

Latitude: 3.1516964, Longitude: 101.6942371


In [None]:
df_43100_test = df_43100.iloc[0:20, ]
df_43100_test

Unnamed: 0,NO_KES,NO_RUMAH,POSKOD,LOKALITI,MUKIM,DAERAH,LATITUDE,LONGITUDE,STATUS_LOK,NO_RUMAH_LOKALITI
18,2016/248602,NO 65,47610.0,USJ 2/2,PETALING,PETALING,3.05908,101.59,Bandar,NO 65 USJ 2/2
389,2016/249670,A6-06-01,47610.0,USJ 1 (ANGSANA APPT),PETALING,PETALING,3.05713,101.783,Bandar,A6-06-01 USJ 1 (ANGSANA APPT)
478,2016/249847,NO 44,47610.0,USJ 4/8 - USJ 4/9,PETALING,PETALING,3.04847,101.575,Bandar,NO 44 USJ 4/8 - USJ 4/9
650,2016/250054,NO 37,47610.0,USJ 6/5 - USJ 6/7,PETALING,PETALING,3.04889,101.588,Bandar,NO 37 USJ 6/5 - USJ 6/7
865,2016/250774,NO 8,47610.0,USJ 6/1,PETALING,PETALING,3.05391,101.59,Bandar,NO 8 USJ 6/1
1262,2016/251655,NO 18,47610.0,USJ 6/1,PETALING,PETALING,3.05313,101.59,Bandar,NO 18 USJ 6/1
1742,2016/252712,BLOC V-01-003,47610.0,USJ 6/1 (SUBANG PERDANA GOODYEAR COURT 2),PETALING,PETALING,3.05484,101.588,Bandar,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...
2741,2016/254898,01-03-08,47610.0,PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI),PETALING,PETALING,3.06154,101.611,Bandar,01-03-08 PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI)
3002,2016/255237,NO 02-013,47610.0,USJ 6/1 (SUBANG PERDANA GOODYEAR COURT 2),PETALING,PETALING,3.05455,101.588,Bandar,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...
3050,2016/255691,69,47610.0,USJ 6/2,PETALING,PETALING,3.05067,101.585,Bandar,69 USJ 6/2


In [None]:
import requests

query_list = []
postcode_list = []
latitude_list = []
longitude_list = []

for i in range(len(df_43100_test)):
  api_key = '###INSERT OPEN CAGE DATA API KEY HERE###'
  query = df_43100_test.iloc[i, 9]
  url = f'https://api.opencagedata.com/geocode/v1/json?q={query}&key={api_key}'

  response = requests.get(url)
  data = response.json()

  try:
    # Extract latitude and longitude
    latitude = data['results'][0]['geometry']['lat']
    longitude = data['results'][0]['geometry']['lng']
  except:
    latitude = "N.A."
    longitude = "N.A."

  query_list.append(query)
  postcode_list.append(df_43100_test.iloc[i, 2])
  latitude_list.append(latitude)
  longitude_list.append(longitude)

In [None]:
# form a dataframe to view latitudes and longitudes from opencagedata
df_postcode_test = pd.DataFrame({'Addresses': query_list, "Postcode_dengue": postcode_list, "Latitude_ocd":latitude_list, "Longitude_ocd":longitude_list})
df_postcode_test

Unnamed: 0,Addresses,Postcode_dengue,Latitude_ocd,Longitude_ocd
0,NO 65 USJ 2/2,47610.0,34.665639,135.432453
1,A6-06-01 USJ 1 (ANGSANA APPT),47610.0,47.27541,8.4897
2,NO 44 USJ 4/8 - USJ 4/9,47610.0,3.047883,101.574893
3,NO 37 USJ 6/5 - USJ 6/7,47610.0,3.052684,101.590255
4,NO 8 USJ 6/1,47610.0,3.052196,101.587909
5,NO 18 USJ 6/1,47610.0,3.052196,101.587909
6,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...,47610.0,47.27541,8.4897
7,01-03-08 PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI),47610.0,N.A.,N.A.
8,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...,47610.0,47.27541,8.4897
9,69 USJ 6/2,47610.0,3.059632,101.581486


In [None]:
df_postcode_test.iloc[0, 0]

'NO 65 USJ 2/2'


1.   Open an account with [Google Cloud Platform](https://cloud.google.com/)
2.   Pricing details are available [here](https://developers.google.com/maps/documentation/geocoding/usage-and-billing)
3.   Enable Geocoding API, details [here](https://developers.google.com/maps/documentation/geocoding/start)



In [None]:
import requests

# Replace YOUR_API_KEY with your actual Google API key
api_key = '###INSERT GEOCODING API KEY HERE###'
address = df_postcode_test.iloc[0, 0]
url = f'https://maps.googleapis.com/maps/api/geocode/json?address={address}&key={api_key}'

response = requests.get(url)
data = response.json()

# Extract latitude and longitude
latitude = data['results'][0]['geometry']['location']['lat']
longitude = data['results'][0]['geometry']['location']['lng']
formatted_address = data['results'][0]['formatted_address']

# Extract postal code from address components
address_components = data['results'][0]['address_components']
postcode = None

for component in address_components:
    if 'postal_code' in component['types']:
        postcode = component['long_name']
        break

print(f'Latitude: {latitude}, Longitude: {longitude}')
print(formatted_address)
print(postcode)

Latitude: 3.057089, Longitude: 101.5911209
UEP Subang Jaya, 47600 Subang Jaya, Selangor, Malaysia
47600


In [None]:
query_list_gcp = []
latitude_list_gcp = []
longitude_list_gcp = []
formatted_address_gcp = []
postcode_gcp = []

import requests

for i in range(len(df_postcode_test)):
  # Replace YOUR_API_KEY with your actual Google API key
  api_key = '###INSERT GEOCODING API KEY HERE###'
  address = df_postcode_test.iloc[i, 0]
  url = f'https://maps.googleapis.com/maps/api/geocode/json?address={address}&key={api_key}'

  response = requests.get(url)
  data = response.json()

  # Extract latitude and longitude
  latitude = data['results'][0]['geometry']['location']['lat']
  longitude = data['results'][0]['geometry']['location']['lng']
  formatted_address = data['results'][0]['formatted_address']

  # Extract postal code from address components
  address_components = data['results'][0]['address_components']
  postcode = None

  for component in address_components:
      if 'postal_code' in component['types']:
          postcode = component['long_name']
          break

  query_list_gcp.append(address)
  latitude_list_gcp.append(latitude)
  longitude_list_gcp.append(longitude)
  formatted_address_gcp.append(formatted_address)
  postcode_gcp.append(postcode)

In [None]:
# form dataframe to view oistcode, latitude, longitude, and formatted addresses
df_postcode_test_gcp = pd.DataFrame({'Addresses': query_list_gcp, "Postcode_gcp": postcode_gcp,
                                     "Latitude_gcp":latitude_list_gcp, "Longitude_gcp":longitude_list_gcp, "Addresses_for": formatted_address_gcp})
df_postcode_test_gcp

Unnamed: 0,Addresses,Postcode_gcp,Latitude_gcp,Longitude_gcp,Addresses_for
0,NO 65 USJ 2/2,47600.0,3.057089,101.591121,"UEP Subang Jaya, 47600 Subang Jaya, Selangor, ..."
1,A6-06-01 USJ 1 (ANGSANA APPT),,3.058634,101.593843,"Usj 1, Subang Jaya, Selangor, Malaysia"
2,NO 44 USJ 4/8 - USJ 4/9,47600.0,3.047827,101.575617,"44, Jalan USJ 4/8, Usj 4, 47600 Subang Jaya, S..."
3,NO 37 USJ 6/5 - USJ 6/7,47610.0,3.052812,101.58997,"37, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, S..."
4,NO 8 USJ 6/1,47610.0,3.054503,101.586203,"8, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Se..."
5,NO 18 USJ 6/1,47610.0,3.0544,101.585929,"18, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, S..."
6,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...,47610.0,3.053343,101.591644,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
7,01-03-08 PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI),47500.0,3.061452,101.610781,"Pangsapuri Lagoon Perdana, Jalan PJS 9/1, Band..."
8,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...,47610.0,3.053343,101.591644,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
9,69 USJ 6/2,47610.0,3.053614,101.58809,"Usj 6, 47610 Subang Jaya, Selangor, Malaysia"


In [None]:
# compare data quality from opencagedata to Google Cloud Platform
df_comparison = pd.merge(df_postcode_test, df_postcode_test_gcp, on="Addresses", how="left")
df_comparison

Unnamed: 0,Addresses,Postcode_dengue,Latitude_ocd,Longitude_ocd,Postcode_gcp,Latitude_gcp,Longitude_gcp,Addresses_for
0,NO 65 USJ 2/2,47610.0,34.665639,135.432453,47600.0,3.057089,101.591121,"UEP Subang Jaya, 47600 Subang Jaya, Selangor, ..."
1,A6-06-01 USJ 1 (ANGSANA APPT),47610.0,47.27541,8.4897,,3.058634,101.593843,"Usj 1, Subang Jaya, Selangor, Malaysia"
2,NO 44 USJ 4/8 - USJ 4/9,47610.0,3.047883,101.574893,47600.0,3.047827,101.575617,"44, Jalan USJ 4/8, Usj 4, 47600 Subang Jaya, S..."
3,NO 37 USJ 6/5 - USJ 6/7,47610.0,3.052684,101.590255,47610.0,3.052812,101.58997,"37, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, S..."
4,NO 8 USJ 6/1,47610.0,3.052196,101.587909,47610.0,3.054503,101.586203,"8, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Se..."
5,NO 18 USJ 6/1,47610.0,3.052196,101.587909,47610.0,3.0544,101.585929,"18, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, S..."
6,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...,47610.0,47.27541,8.4897,47610.0,3.053343,101.591644,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
7,01-03-08 PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI),47610.0,N.A.,N.A.,47500.0,3.061452,101.610781,"Pangsapuri Lagoon Perdana, Jalan PJS 9/1, Band..."
8,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...,47610.0,47.27541,8.4897,47610.0,3.053343,101.591644,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
9,69 USJ 6/2,47610.0,3.059632,101.581486,47610.0,3.053614,101.58809,"Usj 6, 47610 Subang Jaya, Selangor, Malaysia"


## 2:00 pm: Text normalization techniques

### Converting Text to Lowercase
Convert all characters in a text to lowercase.

In [None]:
def to_lowercase(text):
    return text.lower()

text = "Text Normalization Techniques Are Important!"
lowercase_text = to_lowercase(text)
print(lowercase_text)

text normalization techniques are important!


### Expanding Contractions
Expand common contractions (e.g., "don't" to "do not").

In [None]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K

In [None]:
import contractions

def expand_contractions(text):
    return contractions.fix(text)

text = "I don't know if it's going to work."
expanded_text = expand_contractions(text)
print(expanded_text)

I do not know if it is going to work.


### Handling Special Characters and Numbers
Remove special characters and numbers, keeping only letters and spaces.

In [None]:
import re

def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)

text = "Text normalization: @2024 is a key! #NLP"
cleaned_text = remove_special_characters(text)
print(cleaned_text)

Text normalization  is a key NLP


### Normalizing Whitespace
Normalize whitespace by collapsing multiple spaces into a single space and stripping leading/trailing spaces.

In [None]:
def normalize_whitespace(text):
    return ' '.join(text.split())

text = "  Text    normalization    involves   spaces.   "
normalized_text = normalize_whitespace(text)
print(normalized_text)

Text normalization involves spaces.


### Removing Non-ASCII Characters
Remove non-ASCII characters from the text.

In [None]:
def remove_non_ascii(text):
    return ''.join(char for char in text if ord(char) < 128)

text = "This is a test with non-ASCII characters: ü, ñ, ß."
ascii_text = remove_non_ascii(text)
print(ascii_text)

This is a test with non-ASCII characters: , , .


### Normalizing Text Using Lemmatization
Convert words to their base or dictionary form (lemmatization).

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Ensure you have downloaded wordnet: nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    words = word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

text = "The quick brown foxes are jumping over the lazy dogs."
lemmatized_text = lemmatize_text(text)
print(lemmatized_text)

The quick brown fox are jumping over the lazy dog .


## 3:00 pm: Feature extraction for NLP

### Bag of Words (BoW)
Convert a collection of text documents into a matrix of token counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "I love programming in Python.",
    "Python programming is fun.",
    "I enjoy solving problems with Python."
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

# Convert the matrix to a DataFrame for better readability
import pandas as pd
df = pd.DataFrame(X.toarray(), columns=feature_names)
df

Unnamed: 0,enjoy,fun,in,is,love,problems,programming,python,solving,with
0,0,0,1,0,1,0,1,1,0,0
1,0,1,0,1,0,0,1,1,0,0
2,1,0,0,0,0,1,0,1,1,1


### Term Frequency-Inverse Document Frequency (TF-IDF)
Convert text documents into a matrix of TF-IDF features.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "I love programming in Python.",
    "Python programming is fun.",
    "I enjoy solving problems with Python."
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

# Convert the matrix to a DataFrame for better readability
df = pd.DataFrame(X.toarray(), columns=feature_names)
df

Unnamed: 0,enjoy,fun,in,is,love,problems,programming,python,solving,with
0,0.0,0.0,0.584483,0.0,0.584483,0.0,0.444514,0.345205,0.0,0.0
1,0.0,0.584483,0.0,0.584483,0.0,0.0,0.444514,0.345205,0.0,0.0
2,0.479528,0.0,0.0,0.0,0.0,0.479528,0.0,0.283217,0.479528,0.479528


### Word Embeddings (Word2Vec)
Use pre-trained word embeddings to represent words in a text.

In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Sample corpus
corpus = [
    "I love programming in Python.",
    "Python programming is fun and exciting.",
    "I enjoy solving complex problems using Python.",
    "Natural Language Processing with Python is amazing.",
    "Deep learning is a subset of machine learning."
]

# Tokenize the sentences
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Get the word embedding for the word 'python'
python_vector = model.wv['python']
print(f"Embedding for 'python':\n{python_vector}")

# Find the most similar words to 'python'
similar_words = model.wv.most_similar('python', topn=3)
print("\nMost similar words to 'python':")
for word, similarity in similar_words:
    print(f"{word}: {similarity}")

Embedding for 'python':
[-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419385e-03
  7.4669183e-03 -6.1676754e-03  1.1056137e-03  6.0472824e-03
 -2.8400505e-03 -6.1735227e-03 -4.1022300e-04 -8.3689485e-03
 -5.6000124e-03  7.1045388e-03  3.3525396e-03  7.2256695e-03
  6.8002474e-03  7.5307419e-03 -3.7891543e-03 -5.6180597e-04
  2.3483764e-03 -4.5190323e-03  8.3887316e-03 -9.8581640e-03
  6.7646410e-03  2.9144168e-03 -4.9328315e-03  4.3981876e-03
 -1.7395747e-03  6.7113843e-03  9.9648498e-03 -4.3624435e-03
 -5.9933780e-04 -5.6956373e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384968e-03  9.2734173e-03
  7.8980681e-03 -6.9895042e-03 -9.1558648e-03 -3.5575271e-04
 -3.0998408e-03  7.8943167e-03  5.9385742e-03 -1.5456629e-03
  1.5109634e-03  1.7900408e-03  7.8175711e-03 -9.5101865e-03
 -2.0553112e-04  3.4691966e-03 -9.3897223e-04  8.3817719e-03
  9.0107834e-03  6.5365066e-03 -7.1162102e-04  7.7104042e-03
 -8.5343346e-03  3.2071066e-03 -4.6379971e-03 -5.0889552e-03


### Document Embeddings (Doc2Vec)
Use Doc2Vec to obtain vector representations of entire documents.

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Prepare tagged documents
documents = [
    TaggedDocument(words="I love programming in Python".lower().split(), tags=['doc1']),
    TaggedDocument(words="Python programming is fun".lower().split(), tags=['doc2']),
    TaggedDocument(words="I enjoy solving problems with Python".lower().split(), tags=['doc3'])
]

# Train a Doc2Vec model
model = Doc2Vec(vector_size=50, window=2, min_count=1, workers=4)
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=10)

# Get document vectors
doc1_vector = model.infer_vector(["I", "love", "programming", "in", "python"])
print(doc1_vector)

[-0.00419079  0.00407299  0.00157768  0.00301741  0.00932281  0.00840715
 -0.00456437  0.00577516 -0.00367907 -0.0035626   0.00071434 -0.00419611
  0.00947974  0.00489331  0.00346354  0.00586407 -0.00724905 -0.00139906
 -0.00183649 -0.00937395 -0.00358292  0.00027257 -0.0047359   0.00027171
  0.0057235  -0.00827805 -0.00096316 -0.0023589  -0.00849251 -0.00958847
  0.00296806  0.00493609 -0.00480063 -0.00320215 -0.00356998 -0.00248936
  0.0029669  -0.0092426   0.00114074 -0.00101644  0.00868834 -0.00756774
  0.00199392  0.00767373 -0.00686044  0.00755328  0.00677737 -0.00890561
 -0.00733081  0.0032679 ]


### N-grams
Extract n-grams (e.g., bigrams) from a text document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "I love programming in Python.",
    "Python programming is fun.",
    "I enjoy solving problems with Python."
]

# Create bigrams
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

# Convert the matrix to a DataFrame for better readability
df = pd.DataFrame(X.toarray(), columns=feature_names)
print(df)

   enjoy solving  in python  is fun  love programming  problems with  \
0              0          1       0                 1              0   
1              0          0       1                 0              0   
2              1          0       0                 0              1   

   programming in  programming is  python programming  solving problems  \
0               1               0                   0                 0   
1               0               1                   1                 0   
2               0               0                   0                 1   

   with python  
0            0  
1            0  
2            1  


## 4:00 pm: Practical session 8

In [None]:
# try openstreetmap from https://nominatim.org/release-docs/develop/api/Search/#examples
# https://nominatim.openstreetmap.org/search?q=Unter%20den%20Linden%201%20Berlin&format=json&addressdetails=1&limit=1&polygon_svg=1
df_postcode_test

Unnamed: 0,Addresses,Postcode_dengue,Latitude_ocd,Longitude_ocd
0,NO 65 USJ 2/2,47610.0,34.665639,135.432453
1,A6-06-01 USJ 1 (ANGSANA APPT),47610.0,47.27541,8.4897
2,NO 44 USJ 4/8 - USJ 4/9,47610.0,3.047883,101.574893
3,NO 37 USJ 6/5 - USJ 6/7,47610.0,3.052684,101.590255
4,NO 8 USJ 6/1,47610.0,3.052196,101.587909
5,NO 18 USJ 6/1,47610.0,3.052196,101.587909
6,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...,47610.0,47.27541,8.4897
7,01-03-08 PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI),47610.0,N.A.,N.A.
8,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...,47610.0,47.27541,8.4897
9,69 USJ 6/2,47610.0,3.059632,101.581486


In [None]:
# url encode the address, i.e. no spaces or other special characters
from urllib.parse import quote
df_postcode_test['encoded_osm'] = df_postcode_test['Addresses'].apply(quote)
df_postcode_test

Unnamed: 0,Addresses,Postcode_dengue,Latitude_ocd,Longitude_ocd,encoded_osm
0,NO 65 USJ 2/2,47610.0,34.665639,135.432453,NO%2065%20USJ%202/2
1,A6-06-01 USJ 1 (ANGSANA APPT),47610.0,47.27541,8.4897,A6-06-01%20USJ%201%20%28ANGSANA%20APPT%29
2,NO 44 USJ 4/8 - USJ 4/9,47610.0,3.047883,101.574893,NO%2044%20USJ%204/8%20-%20USJ%204/9
3,NO 37 USJ 6/5 - USJ 6/7,47610.0,3.052684,101.590255,NO%2037%20USJ%206/5%20-%20USJ%206/7
4,NO 8 USJ 6/1,47610.0,3.052196,101.587909,NO%208%20USJ%206/1
5,NO 18 USJ 6/1,47610.0,3.052196,101.587909,NO%2018%20USJ%206/1
6,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...,47610.0,47.27541,8.4897,BLOC%20V-01-003%20USJ%206/1%20%28SUBANG%20PERD...
7,01-03-08 PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI),47610.0,N.A.,N.A.,01-03-08%20PJS%209%20/1%28LAGOON%20PERDANA%20%...
8,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...,47610.0,47.27541,8.4897,NO%2002-013%20USJ%206/1%20%28SUBANG%20PERDANA%...
9,69 USJ 6/2,47610.0,3.059632,101.581486,69%20USJ%206/2


In [None]:
df_postcode_test.iloc[2, 4]

'NO%2044%20USJ%204/8%20-%20USJ%204/9'

In [None]:
len(df_postcode_test)

20

In [None]:
# try another data source: open street map
address = df_postcode_test.iloc[2, 4]
url = f'https://nominatim.openstreetmap.org/search?q={address}&format=json&addressdetails=1&limit=1&polygon_svg=1'

# Define headers with a custom User-Agent
headers = {
    'User-Agent': 'YourAppName/1.0 (your-email@example.com)'  # Replace with your app name and email
}

response = requests.get(url, headers = headers)
data = response.json()
data

[]

In [None]:
data[0]["lat"]

In [None]:
import requests
import time

query_list_test = []
query_list_osm = []
latitude_list_osm = []
longitude_list_osm = []
formatted_address_osm = []
postcode_osm = []

for i in range(0, len(df_postcode_test)):

  address = df_postcode_test.iloc[i, 4]
  url = f'https://nominatim.openstreetmap.org/search?q={address}&format=json&addressdetails=1&limit=1&polygon_svg=1'

  headers = {
    'User-Agent': 'YourAppName/1.0 (your-email@example.com)'  # Replace with your app name and email
  }

  response = requests.get(url, headers = headers)
  data = response.json()
  try:

    query_list_osm.append(address)
    latitude_list_osm.append(data[0]["lat"])
    longitude_list_osm.append(data[0]["lon"])
    formatted_address_osm.append(data[0]["display_name"])
    postcode_osm.append(data[0]["address"]["postcode"])

  except:
    query_list_osm.append("N.A.")
    latitude_list_osm.append("N.A.")
    longitude_list_osm.append("N.A.")
    formatted_address_osm.append("N.A.")
    postcode_osm.append("N.A.")

  # Sleep for 2 seconds between requests
  time.sleep(2)

In [None]:
df_postcode_test_osm = pd.DataFrame({'Addresses': df_postcode_test["Addresses"], "Postcode_osm": postcode_osm,
                                     "Latitude_osm":latitude_list_osm, "Longitude_osm":longitude_list_osm, "Addresses_for": formatted_address_osm})
df_postcode_test_osm

Unnamed: 0,Addresses,Postcode_osm,Latitude_osm,Longitude_osm,Addresses_for
0,NO 65 USJ 2/2,47200,3.0605926,101.5831298,"Jalan USJ 2/5E, USJ 2, UEP Subang Jaya, Subang..."
1,A6-06-01 USJ 1 (ANGSANA APPT),N.A.,N.A.,N.A.,N.A.
2,NO 44 USJ 4/8 - USJ 4/9,N.A.,N.A.,N.A.,N.A.
3,NO 37 USJ 6/5 - USJ 6/7,N.A.,N.A.,N.A.,N.A.
4,NO 8 USJ 6/1,47200,3.0475069,101.6010661,"Jalan USJ 1/6, USJ 1, UEP Subang Jaya, Puchong..."
5,NO 18 USJ 6/1,47200,3.0475069,101.6010661,"Jalan USJ 1/6, USJ 1, UEP Subang Jaya, Puchong..."
6,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...,N.A.,N.A.,N.A.,N.A.
7,01-03-08 PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI),N.A.,N.A.,N.A.,N.A.
8,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...,N.A.,N.A.,N.A.,N.A.
9,69 USJ 6/2,47600,3.0533493,101.5618662,"Jalan Pinggiran USJ 2/6, Taman Pinggiran USJ, ..."


In [None]:
# compare results between opencagedata and openstreetmap
df_comparison2 = pd.merge(df_postcode_test, df_postcode_test_osm, on="Addresses", how="left")
df_comparison2

Unnamed: 0,Addresses,Postcode_dengue,Latitude_ocd,Longitude_ocd,encoded_osm,Postcode_osm,Latitude_osm,Longitude_osm,Addresses_for
0,NO 65 USJ 2/2,47610.0,34.665639,135.432453,NO%2065%20USJ%202/2,47200,3.0605926,101.5831298,"Jalan USJ 2/5E, USJ 2, UEP Subang Jaya, Subang..."
1,A6-06-01 USJ 1 (ANGSANA APPT),47610.0,47.27541,8.4897,A6-06-01%20USJ%201%20%28ANGSANA%20APPT%29,N.A.,N.A.,N.A.,N.A.
2,NO 44 USJ 4/8 - USJ 4/9,47610.0,3.047883,101.574893,NO%2044%20USJ%204/8%20-%20USJ%204/9,N.A.,N.A.,N.A.,N.A.
3,NO 37 USJ 6/5 - USJ 6/7,47610.0,3.052684,101.590255,NO%2037%20USJ%206/5%20-%20USJ%206/7,N.A.,N.A.,N.A.,N.A.
4,NO 8 USJ 6/1,47610.0,3.052196,101.587909,NO%208%20USJ%206/1,47200,3.0475069,101.6010661,"Jalan USJ 1/6, USJ 1, UEP Subang Jaya, Puchong..."
5,NO 18 USJ 6/1,47610.0,3.052196,101.587909,NO%2018%20USJ%206/1,47200,3.0475069,101.6010661,"Jalan USJ 1/6, USJ 1, UEP Subang Jaya, Puchong..."
6,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...,47610.0,47.27541,8.4897,BLOC%20V-01-003%20USJ%206/1%20%28SUBANG%20PERD...,N.A.,N.A.,N.A.,N.A.
7,01-03-08 PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI),47610.0,N.A.,N.A.,01-03-08%20PJS%209%20/1%28LAGOON%20PERDANA%20%...,N.A.,N.A.,N.A.,N.A.
8,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...,47610.0,47.27541,8.4897,NO%2002-013%20USJ%206/1%20%28SUBANG%20PERDANA%...,N.A.,N.A.,N.A.,N.A.
9,69 USJ 6/2,47610.0,3.059632,101.581486,69%20USJ%206/2,47600,3.0533493,101.5618662,"Jalan Pinggiran USJ 2/6, Taman Pinggiran USJ, ..."


In [None]:
df_comparison

Unnamed: 0,Addresses,Postcode_dengue,Latitude_ocd,Longitude_ocd,Postcode_gcp,Latitude_gcp,Longitude_gcp,Addresses_for
0,NO 65 USJ 2/2,47610.0,34.665639,135.432453,47600.0,3.057089,101.591121,"UEP Subang Jaya, 47600 Subang Jaya, Selangor, ..."
1,A6-06-01 USJ 1 (ANGSANA APPT),47610.0,47.27541,8.4897,,3.058634,101.593843,"Usj 1, Subang Jaya, Selangor, Malaysia"
2,NO 44 USJ 4/8 - USJ 4/9,47610.0,3.047883,101.574893,47600.0,3.047827,101.575617,"44, Jalan USJ 4/8, Usj 4, 47600 Subang Jaya, S..."
3,NO 37 USJ 6/5 - USJ 6/7,47610.0,3.052684,101.590255,47610.0,3.052812,101.58997,"37, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, S..."
4,NO 8 USJ 6/1,47610.0,3.052196,101.587909,47610.0,3.054503,101.586203,"8, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Se..."
5,NO 18 USJ 6/1,47610.0,3.052196,101.587909,47610.0,3.0544,101.585929,"18, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, S..."
6,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...,47610.0,47.27541,8.4897,47610.0,3.053343,101.591644,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
7,01-03-08 PJS 9 /1(LAGOON PERDANA & RUMAH KEDAI),47610.0,N.A.,N.A.,47500.0,3.061452,101.610781,"Pangsapuri Lagoon Perdana, Jalan PJS 9/1, Band..."
8,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...,47610.0,47.27541,8.4897,47610.0,3.053343,101.591644,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
9,69 USJ 6/2,47610.0,3.059632,101.581486,47610.0,3.053614,101.58809,"Usj 6, 47610 Subang Jaya, Selangor, Malaysia"


In [None]:
#geocode for missing postcodes
df_missing_postcodes

Unnamed: 0,NO_KES,NO_RUMAH,POSKOD,LOKALITI,MUKIM,DAERAH,LATITUDE,LONGITUDE,STATUS_LOK,NO_RUMAH_LOKALITI
0,2016/246923,"D-302 KIP,",,BL PJU 6 BU 11 : KUATERS GURU,DAMANSARA,PETALING,3.13143,101.608,Bandar,"D-302 KIP, BL PJU 6 BU 11 : KUATERS GURU"
5,2016/248206,LOT 12,,SEKSYEN U19 (KG. PAYA JARAS DALAM),SUNGAI BULOH,PETALING,3.18971,101.544,Bandar,LOT 12 SEKSYEN U19 (KG. PAYA JARAS DALAM)
8,2016/248313,LOT 32840,,SEKSYEN U5 ( TAMAN SEGAR ),BUKIT RAJA,PETALING,3.17433,101.532,Bandar,LOT 32840 SEKSYEN U5 ( TAMAN SEGAR )
21,2016/248607,"NO S2-24A,",,SEKSYEN 7 (PUSAT KOMERSIAL A : JLN AA 7/AA - A...,BUKIT RAJA,PETALING,3.06853,101.491,Bandar,"NO S2-24A, SEKSYEN 7 (PUSAT KOMERSIAL A : JLN ..."
23,2016/248612,NO 47,,SEKSYEN 13 ( TERES B : JLN 13/16 - 13/25 & 13/...,BUKIT RAJA,PETALING,3.08841,101.541,Bandar,NO 47 SEKSYEN 13 ( TERES B : JLN 13/16 - 13/25...
...,...,...,...,...,...,...,...,...,...,...
294000,2022/3553,"NO 8C , JALAN 16 , SELAYANG BARU , 68100 BATU ...",,SELAYANG BARU ZON 15,BATU,GOMBAK,3.24540,101.672,Bandar,"NO 8C , JALAN 16 , SELAYANG BARU , 68100 BATU ..."
294002,2022/3555,(10/20 BLOK TERATAI TAMAN TUN TEJA RAWANG),,APPT TERATAI TTT,RAWANG,GOMBAK,3.29570,101.591,Bandar,(10/20 BLOK TERATAI TAMAN TUN TEJA RAWANG) APP...
294003,2022/3556,"B6-01-01 , APARTMENT PALMA , JALAN DESA RIA ,...",,APPT PALMA BCH,RAWANG,GOMBAK,3.32510,101.536,Bandar,"B6-01-01 , APARTMENT PALMA , JALAN DESA RIA ,..."
294006,2022/3561,"NO.0103 , BLOK SEROJA , TAMAN TUN TEJA .",,APPT SEROJA TTT,RAWANG,GOMBAK,3.29720,101.589,Bandar,"NO.0103 , BLOK SEROJA , TAMAN TUN TEJA . APPT..."


In [None]:
df_missing_postcodes_sample = df_missing_postcodes.iloc[0:30, :]
df_missing_postcodes_sample

Unnamed: 0,NO_KES,NO_RUMAH,POSKOD,LOKALITI,MUKIM,DAERAH,LATITUDE,LONGITUDE,STATUS_LOK,NO_RUMAH_LOKALITI
0,2016/246923,"D-302 KIP,",,BL PJU 6 BU 11 : KUATERS GURU,DAMANSARA,PETALING,3.13143,101.608,Bandar,"D-302 KIP, BL PJU 6 BU 11 : KUATERS GURU"
5,2016/248206,LOT 12,,SEKSYEN U19 (KG. PAYA JARAS DALAM),SUNGAI BULOH,PETALING,3.18971,101.544,Bandar,LOT 12 SEKSYEN U19 (KG. PAYA JARAS DALAM)
8,2016/248313,LOT 32840,,SEKSYEN U5 ( TAMAN SEGAR ),BUKIT RAJA,PETALING,3.17433,101.532,Bandar,LOT 32840 SEKSYEN U5 ( TAMAN SEGAR )
21,2016/248607,"NO S2-24A,",,SEKSYEN 7 (PUSAT KOMERSIAL A : JLN AA 7/AA - A...,BUKIT RAJA,PETALING,3.06853,101.491,Bandar,"NO S2-24A, SEKSYEN 7 (PUSAT KOMERSIAL A : JLN ..."
23,2016/248612,NO 47,,SEKSYEN 13 ( TERES B : JLN 13/16 - 13/25 & 13/...,BUKIT RAJA,PETALING,3.08841,101.541,Bandar,NO 47 SEKSYEN 13 ( TERES B : JLN 13/16 - 13/25...
34,2016/248651,36A,,SEKSYEN 7 (KOLEJ UNISEL & RUMAH KEDAI),BUKIT RAJA,PETALING,3.07671,101.497,Bandar,36A SEKSYEN 7 (KOLEJ UNISEL & RUMAH KEDAI)
35,2016/248652,KOLEJ JASMIN 3DB,,SEKSYEN U10 ( PUNCAK PERDANA : UITM ),BUKIT RAJA,PETALING,3.13409,101.492,Bandar,KOLEJ JASMIN 3DB SEKSYEN U10 ( PUNCAK PERDANA ...
41,2016/248710,"NO 38,",,TMN PUCHONG UTAMA (JLN PU 6),PETALING 2,PETALING,2.99386,101.614,Bandar,"NO 38, TMN PUCHONG UTAMA (JLN PU 6)"
52,2016/248810,NO.7,,USJ 11/4,PETALING,PETALING,3.04797,101.578,Bandar,NO.7 USJ 11/4
59,2016/248819,"LION INDUSTRIAL PARK, LOT 204",,SEKSYEN 26 ( KILANG TIONG NAM ),BUKIT RAJA,PETALING,3.03755,101.545,Bandar,"LION INDUSTRIAL PARK, LOT 204 SEKSYEN 26 ( KIL..."


In [None]:
df_missing_postcodes_sample.head(2)

Unnamed: 0,NO_KES,NO_RUMAH,POSKOD,LOKALITI,MUKIM,DAERAH,LATITUDE,LONGITUDE,STATUS_LOK,NO_RUMAH_LOKALITI
0,2016/246923,"D-302 KIP,",,BL PJU 6 BU 11 : KUATERS GURU,DAMANSARA,PETALING,3.13143,101.608,Bandar,"D-302 KIP, BL PJU 6 BU 11 : KUATERS GURU"
5,2016/248206,LOT 12,,SEKSYEN U19 (KG. PAYA JARAS DALAM),SUNGAI BULOH,PETALING,3.18971,101.544,Bandar,LOT 12 SEKSYEN U19 (KG. PAYA JARAS DALAM)


In [None]:
df_missing_postcodes_sample.iloc[0, 9]

'D-302 KIP, BL PJU 6 BU 11 : KUATERS GURU'

In [None]:
import requests

# get postcodes, coordinates, and formatted addresses for other addresses
query_list_test = []
query_list_osm = []
latitude_list_osm = []
longitude_list_osm = []
formatted_address_osm = []
postcode_osm = []

for i in range(0, len(df_missing_postcodes_sample)):

  address = df_missing_postcodes_sample.iloc[i, 9]
  api_key = '###INSERT GEOCODING API KEY HERE###'
  url = f'https://maps.googleapis.com/maps/api/geocode/json?address={address}&key={api_key}'

  response = requests.get(url)
  data = response.json()

  query_list_test.append(df_missing_postcodes_sample.iloc[i, 9])
  query_list_osm.append(address)
  try:
    latitude_list_osm.append(data["results"][0]["geometry"]["location"]["lat"])
    longitude_list_osm.append(data["results"][0]["geometry"]["location"]["lng"])
    formatted_address_osm.append(data["results"][0]["formatted_address"])
  except:
    latitude_list_osm.append("N.A.")
    longitude_list_osm.append("N.A.")
    formatted_address_osm.append("N.A.")

  # Retrieve the postcode
  postcode = None
  try:
    for component in data['results'][0]['address_components']:
        if 'postal_code' in component['types']:
            postcode = component['long_name']
  except:
    postcode = "N.A."

  postcode_osm.append(postcode)

In [None]:
df_postcode_test_gcp2 = pd.DataFrame({'Addresses': query_list_test, "Postcode_gcp": postcode_osm,
                                     "Latitude_gcp":latitude_list_osm, "Longitude_gcp":longitude_list_osm, "Addresses_for": formatted_address_osm})
df_postcode_test_gcp2

Unnamed: 0,Addresses,Postcode_gcp,Latitude_gcp,Longitude_gcp,Addresses_for
0,"D-302 KIP, BL PJU 6 BU 11 : KUATERS GURU",47400,3.126222,101.604451,"Bu 11, 47400 Petaling Jaya, Selangor, Malaysia"
1,LOT 12 SEKSYEN U19 (KG. PAYA JARAS DALAM),40100,3.090607,101.529597,"Seksyen 13, 40100 Shah Alam, Selangor, Malaysia"
2,LOT 32840 SEKSYEN U5 ( TAMAN SEGAR ),56100,3.090148,101.741498,"Taman Segar, 56100 Kuala Lumpur, Wilayah Perse..."
3,"NO S2-24A, SEKSYEN 7 (PUSAT KOMERSIAL A : JLN ...",,3.074107,101.492219,"Seksyen 7, Shah Alam, Selangor, Malaysia"
4,NO 47 SEKSYEN 13 ( TERES B : JLN 13/16 - 13/25...,,3.08205,101.538892,"Seksyen 13, Shah Alam, Selangor, Malaysia"
5,36A SEKSYEN 7 (KOLEJ UNISEL & RUMAH KEDAI),,3.074107,101.492219,"Seksyen 7, Shah Alam, Selangor, Malaysia"
6,KOLEJ JASMIN 3DB SEKSYEN U10 ( PUNCAK PERDANA ...,40150,3.13366,101.493738,"Jalan Pulau Indah Au10/A, Puncak Perdana, 4015..."
7,"NO 38, TMN PUCHONG UTAMA (JLN PU 6)",47100,2.991741,101.615601,"Taman Puchong Utama, 47100 Puchong, Selangor, ..."
8,NO.7 USJ 11/4,47620,3.047621,101.579697,"Usj 11, 47620 Subang Jaya, Selangor, Malaysia"
9,"LION INDUSTRIAL PARK, LOT 204 SEKSYEN 26 ( KIL...",40400,3.027183,101.562356,"Seksyen 26, 40400 Shah Alam, Selangor, Malaysia"


In [None]:
#get list of tamans based on postcode
df_postcode_test_gcp3 = df_postcode_test_gcp[df_postcode_test_gcp["Postcode_gcp"] == "47610"]
df_postcode_test_gcp3

Unnamed: 0,Addresses,Postcode_gcp,Latitude_gcp,Longitude_gcp,Addresses_for
3,NO 37 USJ 6/5 - USJ 6/7,47610,3.052812,101.58997,"37, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, S..."
4,NO 8 USJ 6/1,47610,3.054503,101.586203,"8, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Se..."
5,NO 18 USJ 6/1,47610,3.0544,101.585929,"18, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, S..."
6,BLOC V-01-003 USJ 6/1 (SUBANG PERDANA GOODYEAR...,47610,3.053343,101.591644,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
8,NO 02-013 USJ 6/1 (SUBANG PERDANA GOODYEAR COU...,47610,3.053343,101.591644,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
9,69 USJ 6/2,47610,3.053614,101.58809,"Usj 6, 47610 Subang Jaya, Selangor, Malaysia"
16,NO 23 USJ 6/5 - USJ 6/7,47610,3.052807,101.589808,"23, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, S..."
17,NO 18 USJ 6/2,47610,3.053614,101.58809,"Usj 6, 47610 Subang Jaya, Selangor, Malaysia"
19,NO 11 USJ 6/1,47610,3.054676,101.585738,"11, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, S..."


In [None]:
df_postcode_test_gcp3[["Addresses_for"]]

Unnamed: 0,Addresses_for
3,"37, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, S..."
4,"8, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Se..."
5,"18, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, S..."
6,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
8,"N-00-001, Subang Perdana GoodYear Court, 2, Ja..."
9,"Usj 6, 47610 Subang Jaya, Selangor, Malaysia"
16,"23, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, S..."
17,"Usj 6, 47610 Subang Jaya, Selangor, Malaysia"
19,"11, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, S..."


In [None]:
from collections import Counter
import re

# Your list of addresses
addresses = [
    "3\t37, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "4\t8, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "5\t18, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "6\tN-00-001, Subang Perdana GoodYear Court, 2, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "8\tN-00-001, Subang Perdana GoodYear Court, 2, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "9\tUsj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "16\t23, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "17\tUsj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "19\t11, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia"
]

# Prepare a list to hold all 2-grams
two_grams = []

# Define a function to extract 2-grams
def extract_2grams(address):
    # Tokenize the address into words, removing punctuation
    words = re.findall(r'\b\w+\b', address)
    # Create 2-grams from the list of words
    return [(words[i], words[i + 1]) for i in range(len(words) - 1)]

# Extract 2-grams from each address and accumulate them
for address in list(df_postcode_test_gcp3["Addresses_for"]):
    two_grams.extend(extract_2grams(address))

# Count occurrences of each 2-gram
two_gram_counts = Counter(two_grams)

# Get the most common 2-gram
most_common_2gram = two_gram_counts.most_common(1)

# Print the most common 2-gram
if most_common_2gram:
    two_gram, count = most_common_2gram[0]
    print(f"Most common 2-gram: {' '.join(two_gram)}: {count}")


Most common 2-gram: Usj 6: 9


In [None]:
from collections import Counter
import re

# Your list of addresses
addresses = [
    "3\t37, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "4\t8, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "5\t18, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "6\tN-00-001, Subang Perdana GoodYear Court, 2, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "8\tN-00-001, Subang Perdana GoodYear Court, 2, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "9\tUsj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "16\t23, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "17\tUsj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "19\t11, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia"
]

# Function to extract n-grams
def extract_ngrams(address, n):
    # Tokenize the address into words, removing punctuation
    words = re.findall(r'\b\w+\b', address)
    # Create n-grams from the list of words
    return [tuple(words[i:i + n]) for i in range(len(words) - n + 1)]

# Set n for n-grams
n = 3  # Change this value for different n-grams

# Prepare a list to hold all n-grams
n_grams = []

# Extract n-grams from each address and accumulate them
for address in list(df_postcode_test_gcp3["Addresses_for"]):
    n_grams.extend(extract_ngrams(address, n))

# Count occurrences of each n-gram
n_gram_counts = Counter(n_grams)

# Get the most common n-gram
most_common_ngram = n_gram_counts.most_common(1)

# Print the most common n-gram
if most_common_ngram:
    n_gram, count = most_common_ngram[0]
    print(f"Most common {n}-gram: {' '.join(n_gram)}: {count}")


Most common 3-gram: Usj 6 47610: 9
