<a href="https://colab.research.google.com/github/atlas-github/nih_time_series_nlp/blob/main/nih_time_series_nlp_day3_start.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 09:00 am: Practical session 6

### Split Data into Training and Testing Sets for Time Series
Since time series data has an inherent order, you typically split the data by preserving the time order (i.e., earlier data for training, later data for testing).

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Sample time series data
data = pd.date_range('2020-01-01', periods=100, freq='D')
values = range(100)
df = pd.DataFrame({'date': data, 'value': values})

# Split the data by time index


### Time Series Cross-Validation Techniques
The `TimeSeriesSplit` function in Scikit-learn can be used for time series cross-validation. It splits the data sequentially, preserving the order of time.

In [None]:
from sklearn.model_selection import TimeSeriesSplit

# Sample data
tscv = TimeSeriesSplit(n_splits=3)



### Rolling Window Cross-Validation
This technique creates a series of rolling windows of training and testing data, which is suitable for time series forecasting tasks.

In [None]:


# Example with window size 80


## 09:45 am: Introduction to NLP data preprocessing

### Tokenization
Tokenize a sentence into words.

In [None]:
from nltk.tokenize import word_tokenize
import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

### Stopword removal

Remove common stopwords from a list of tokens.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure you have downloaded stopwords: nltk.download('stopwords')


### Stemming
Apply stemming to reduce words to their base form.

### Lemmatization

Apply lemmatization to get the base form of words.

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Ensure you have downloaded wordnet: nltk.download('wordnet')


### Text Normalization
Convert text to lowercase and remove punctuation.

## 11:00 am: Text cleaning techniques

### Removing Special Characters and Numbers
Clean a text by removing special characters and numbers, keeping only letters and spaces.

### Handling Case Sensitivity
Convert all text to lowercase to handle case sensitivity.

### Removing Stopwords
Remove common stopwords from a text.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure you have downloaded stopwords: nltk.download('stopwords')


### Removing Punctuation
Remove punctuation from a text.

### Combine techniques

In [None]:
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure you have downloaded stopwords: nltk.download('stopwords')


## 12:00 pm: Practical session 7

In [None]:
#install the gdown library
!pip install gdown

In [None]:
#download dengue csv
import pandas as pd
import gdown

# Replace 'YOUR_FILE_ID' with the actual file ID
file_id = '1F-faNnQoyhdjbuyEVZPHV1cv_h0h5UDl'
url = f'https://drive.google.com/uc?id={file_id}'

# Download the CSV file
gdown.download(url, 'dengue.csv', quiet=False)

# Read the CSV file into a DataFrame
df_dengue = pd.read_csv('dengue.csv')
df_dengue

In [None]:
#filter to only needed columns
df_dengue_filtered = df_dengue[["NO_KES", "NO_RUMAH", "POSKOD", "LOKALITI", "MUKIM", "DAERAH", "LATITUDE", "LONGITUDE", "STATUS_LOK"]]

df_dengue_filtered

In [None]:
#combine No_RUMAH and LOKALITI


In [None]:
#which postcodes have most complete vs. most missing data?
postcode_counts = df_dengue_filtered['POSKOD'].value_counts().reset_index()

# Rename the columns
postcode_counts.columns = ['postcode', 'count']

postcode_counts

In [None]:
#how many NaNs in each column
df_dengue_filtered.isna().sum()

In [None]:
import numpy as np

#filter to only missing postcodes


In [None]:
#verify similarity


In [None]:
!pip install folium

In [None]:
import folium
import pandas as pd



In [None]:
import plotly.graph_objects as go

fig = go.Figure(go.Scattermapbox(
    lat=df_43100['LATITUDE'],
    lon=df_43100['LONGITUDE'],
    mode='markers',
    marker=go.scattermapbox.Marker(size=9),
    text=df_43100['LOKALITI']  # Display the place name when hovering
))

# Define the layout for the map
fig.update_layout(
    mapbox_style="open-street-map",
    mapbox=dict(
        center=dict(lat=3.1390, lon=101.6869),  # Center the map
        zoom=10
    )
)

# Display the map
fig.show()


In [None]:
#sign up on https://opencagedata.com/api


In [None]:
# form a dataframe to view latitudes and longitudes from opencagedata



1.   Open an account with [Google Cloud Platform](https://cloud.google.com/)
2.   Pricing details are available [here](https://developers.google.com/maps/documentation/geocoding/usage-and-billing)
3.   Enable Geocoding API, details [here](https://developers.google.com/maps/documentation/geocoding/start)



In [None]:
import requests

# Replace YOUR_API_KEY with your actual Google API key


In [None]:
# form dataframe to view oistcode, latitude, longitude, and formatted addresses


In [None]:
# compare data quality from opencagedata to Google Cloud Platform
df_comparison = pd.merge(df_postcode_test, df_postcode_test_gcp, on="Addresses", how="left")
df_comparison

## 2:00 pm: Text normalization techniques

### Converting Text to Lowercase
Convert all characters in a text to lowercase.

### Expanding Contractions
Expand common contractions (e.g., "don't" to "do not").

In [None]:
!pip install contractions

### Handling Special Characters and Numbers
Remove special characters and numbers, keeping only letters and spaces.

### Normalizing Whitespace
Normalize whitespace by collapsing multiple spaces into a single space and stripping leading/trailing spaces.

### Removing Non-ASCII Characters
Remove non-ASCII characters from the text.

### Normalizing Text Using Lemmatization
Convert words to their base or dictionary form (lemmatization).

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Ensure you have downloaded wordnet: nltk.download('wordnet')


## 3:00 pm: Feature extraction for NLP

### Bag of Words (BoW)
Convert a collection of text documents into a matrix of token counts.

### Term Frequency-Inverse Document Frequency (TF-IDF)
Convert text documents into a matrix of TF-IDF features.

### Word Embeddings (Word2Vec)
Use pre-trained word embeddings to represent words in a text.

In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Sample corpus
corpus = [
    "I love programming in Python.",
    "Python programming is fun and exciting.",
    "I enjoy solving complex problems using Python.",
    "Natural Language Processing with Python is amazing.",
    "Deep learning is a subset of machine learning."
]

# Tokenize the sentences
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Get the word embedding for the word 'python'
python_vector = model.wv['python']
print(f"Embedding for 'python':\n{python_vector}")

# Find the most similar words to 'python'
similar_words = model.wv.most_similar('python', topn=3)
print("\nMost similar words to 'python':")
for word, similarity in similar_words:
    print(f"{word}: {similarity}")

### Document Embeddings (Doc2Vec)
Use Doc2Vec to obtain vector representations of entire documents.

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Prepare tagged documents
documents = [
    TaggedDocument(words="I love programming in Python".lower().split(), tags=['doc1']),
    TaggedDocument(words="Python programming is fun".lower().split(), tags=['doc2']),
    TaggedDocument(words="I enjoy solving problems with Python".lower().split(), tags=['doc3'])
]

# Train a Doc2Vec model
model = Doc2Vec(vector_size=50, window=2, min_count=1, workers=4)
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=10)

# Get document vectors
doc1_vector = model.infer_vector(["I", "love", "programming", "in", "python"])
print(doc1_vector)

### N-grams
Extract n-grams (e.g., bigrams) from a text document.

## 4:00 pm: Practical session 8

In [None]:
# try openstreetmap from https://nominatim.org/release-docs/develop/api/Search/#examples
# https://nominatim.openstreetmap.org/search?q=Unter%20den%20Linden%201%20Berlin&format=json&addressdetails=1&limit=1&polygon_svg=1


In [None]:
# url encode the address, i.e. no spaces or other special characters


In [None]:
# try another data source: open street map


In [None]:
# compare results between opencagedata and openstreetmap


In [None]:
#geocode for missing postcodes


In [None]:
import requests

# get postcodes, coordinates, and formatted addresses for other addresses


In [None]:
#get list of tamans based on postcode


In [None]:
from collections import Counter
import re

# Your list of addresses
addresses = [
    "3\t37, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "4\t8, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "5\t18, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "6\tN-00-001, Subang Perdana GoodYear Court, 2, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "8\tN-00-001, Subang Perdana GoodYear Court, 2, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "9\tUsj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "16\t23, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "17\tUsj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "19\t11, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia"
]

# Prepare a list to hold all 2-grams
two_grams = []

# Define a function to extract 2-grams


In [None]:
from collections import Counter
import re

# Your list of addresses
addresses = [
    "3\t37, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "4\t8, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "5\t18, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "6\tN-00-001, Subang Perdana GoodYear Court, 2, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "8\tN-00-001, Subang Perdana GoodYear Court, 2, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "9\tUsj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "16\t23, Jalan USJ 6/5, Usj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "17\tUsj 6, 47610 Subang Jaya, Selangor, Malaysia",
    "19\t11, Jalan USJ 6/1, Usj 6, 47610 Subang Jaya, Selangor, Malaysia"
]

# Function to extract n-grams
def extract_ngrams(address, n):
    # Tokenize the address into words, removing punctuation
    words = re.findall(r'\b\w+\b', address)
    # Create n-grams from the list of words
    return [tuple(words[i:i + n]) for i in range(len(words) - n + 1)]

# Set n for n-grams
n = 3  # Change this value for different n-grams

# Prepare a list to hold all n-grams
n_grams = []

# Extract n-grams from each address and accumulate them
for address in list(df_postcode_test_gcp3["Addresses_for"]):
    n_grams.extend(extract_ngrams(address, n))

# Count occurrences of each n-gram
n_gram_counts = Counter(n_grams)

# Get the most common n-gram
most_common_ngram = n_gram_counts.most_common(1)

# Print the most common n-gram
if most_common_ngram:
    n_gram, count = most_common_ngram[0]
    print(f"Most common {n}-gram: {' '.join(n_gram)}: {count}")
