## 1. Prepare dataset (**Uneditable**)

- Use gdown ([github-link](https://github.com/wkentaro/gdown)) to download large file data.
- Description this dataset: 
  - This dataset is generated from GEO (a well-known semantic parsing dataset).The target of this dataset is the relation between the natural sentence and logic-term of its. 
  - It contains 5 columns ['label', '#1 ID', '#2 ID', 'sentence1', 'sentence2']. We only use 'sentence1',  'sentence2'  and 'label' as a input features and label of each training/testing sample, separately. 
  - Size of this dataset: 20400 training samples, 3400 dev samples, around 9500 private test samples.   
- Note: **Undeditable** mean that edit this part code is not accepted. 


In [None]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1tyfX0kv6qvA14JgmCkI4xU2zu0eaIIma -O data.zip
!unzip ./data.zip

Downloading...
From: https://drive.google.com/uc?id=1tyfX0kv6qvA14JgmCkI4xU2zu0eaIIma
To: /content/data.zip
100% 160k/160k [00:00<00:00, 57.5MB/s]
Archive:  ./data.zip
replace __MACOSX/._geo-data-ml-class? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
import pandas as pd
import pickle
from scipy.sparse import coo_matrix, hstack

train = pd.read_csv('geo-data-ml-class/train.enfnamepair', sep=',')
dev = pd.read_csv('geo-data-ml-class/dev.enfnamepair', sep=',')
# test = pd.load('geo-data-ml-class/test.csv')

In [None]:
train.columns

Index(['label', '#1 ID', '#2 ID', 'sentence1', 'sentence2'], dtype='object')

In [None]:
train.iloc[0:3100]

Unnamed: 0,label,#1 ID,#2 ID,sentence1,sentence2
0,0,and,0,and,where is c0
1,1,loc:t,0,loc : t,where is c0
2,0,state:t,0,state : t,where is c0
3,1,lambda,0,lambda,where is c0
4,0,argmax,0,argmax,where is c0
...,...,...,...,...,...
3095,1,loc:t,91,loc : t,what river flow through s0
3096,0,state:t,91,state : t,what river flow through s0
3097,1,lambda,91,lambda,what river flow through s0
3098,0,argmax,91,argmax,what river flow through s0


## 2. Preprocessing data (Editable)

- Concat 2 columns "sentence1" and "sentence2" as a document. 
- Use Tfidf technique ([Tfidf document](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)) to convert document into vector. We convert all training data, development data into vector.

In [None]:
## ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ###
## PUSH YOUR CODE HERE (OPTIONAL)                                                                 ###
## *Note: bellow code flow is an example, you can delete all and write anything you want in here. ###
## ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ###   
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim
import pandas as pd

# Data cleaning
def func_name_norm(f_name):
    f_name = f_name.strip().replace(" : ",":").replace(" _ ","_").replace("0"," ")
    f_name = f_name.strip().replace("is"," ").replace("the"," ").replace("are"," ")
    return f_name

for i in range(len(train)):
    train['sentence1'][i] = func_name_norm(train['sentence1'][i])
    train['sentence2'][i] = func_name_norm(train['sentence2'][i])

# Label that is 0
train_label_0=train.loc[train["label"]==0]

# Label that is 1
train_label_1=train.loc[train["label"]==1]


# Sampling
smp_label_0 = train_label_0.sample(n = 2600)
new_train = pd.concat([smp_label_0,train_label_1])
new_train = train_label_1.append(smp_label_0)

train_text_concatination = new_train[['sentence1', 'sentence2']].agg(' '.join, axis=1)

vectorizer = TfidfVectorizer().fit(train_text_concatination)

def data_frame_to_vector2(df):
    text_concatination = df[['sentence1', 'sentence2']].agg(' '.join, axis=1)
    return vectorizer.transform(text_concatination)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
train_data_vector = data_frame_to_vector2(new_train)
dev_data_vector = data_frame_to_vector2(dev)

In [None]:
new_train

Unnamed: 0,label,#1 ID,#2 ID,sentence1,sentence2
1,1,loc:t,0,loc:t,where c
3,1,lambda,0,lambda,where c
35,1,loc:t,1,loc:t,where m
37,1,lambda,1,lambda,where m
69,1,loc:t,2,loc:t,where c
...,...,...,...,...,...
14811,0,the,435,,how high highest point in s
12524,0,argmin,368,argmin,how long r in mile
12279,0,river:t,361,river:t,what largest citi in s
6967,0,or,204,or,how mani peopl live in s


## 3. Build model and train it (Editable)

- Build model by your self (using sklearn library or external library is accepted). 
- Training and optimize your model to get best performance on development set.
- Finally, save your model prediction in variable **pred** (list of int value 0 or 1) for evaluation in next step.

In [None]:
## ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ###
## PUSH YOUR CODE HERE  (REQUIRED)                                                                ###
## *Note: bellow code flow is an example, you can delete all and write anything you want in here. ###
## ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ###    
 
#  build model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV


params = {'criterion':['gini', 'entropy'],'n_estimators':[100]}
rfc = RandomForestClassifier()
grid = GridSearchCV(rfc, params, cv = 10, scoring="f1")
grid.fit(train_data_vector, new_train['label'])


# Predict the label of development data.
pred = grid.predict(dev_data_vector.toarray())


# Evaluate the score by cross-validation





In [None]:
cv_scores = cross_val_score(grid,train_data_vector,new_train['label'],cv=10)
print('Cross validation score : ',cv_scores)

Cross validation score :  [0.79657388 0.86614173 0.86815416 0.9034749  0.90909091 0.84381339
 0.83991684 0.76623377 0.7887931  0.68505747]


In [None]:
pred[:10] # show first 10 values of the prediction 

array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0])

## 4. Evaluate your model prediction (**Uneditable**)

- Build model by your-self (using sklearn library or external library is accepted). 
- Training and optimize your model to get best performance on development set.
- Note: **Undeditable** mean that edit this block code is not accepted. 

In [None]:
from sklearn.metrics import *
print(classification_report(dev['label'], pred))
precision_score(dev['label'], pred), recall_score(dev['label'], pred), f1_score(dev['label'], pred, pos_label=1)

              precision    recall  f1-score   support

           0       0.99      0.95      0.97      3097
           1       0.61      0.89      0.73       303

    accuracy                           0.94      3400
   macro avg       0.80      0.92      0.85      3400
weighted avg       0.96      0.94      0.95      3400



(0.6141552511415526, 0.8877887788778878, 0.7260458839406209)