# Assignment 5 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [1]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [2]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import Normalizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [None]:
# word net installation:

# unmark if you want to use and need to install
# !pip install wn
# !python -m wn download omw-he:1.4

In [None]:
# word net import:

# unmark if you want to use:
import wn

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [None]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
# !pip install hebrew_tokenizer

In [3]:
# Hebrew tokenizer import:

# unmark if you want to use:
import hebrew_tokenizer as ht

C:\Users\dreek\Desktop\שנה ב' סמסטר ב'\למידת מכונה-Python\מטלה 5 סופי


### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [4]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [5]:
df_train.head(8)
df_train.shape

Unnamed: 0,story,gender
0,"כשחבר הזמין אותי לחול, לא באמת חשבתי שזה יקרה,...",m
1,לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...,m
2,מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...,f
3,"כשהייתי ילד, מטוסים היה הדבר שהכי ריתק אותי. ב...",m
4,‏הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכי...,f
5,לפני כ3 חודשים טסתי לרומא למשך שבוע. טסתי במטו...,f
6,אני כבר שנתיים נשוי והשנה אני ואישתי סוף סוף י...,m
7,השנה התחלנו שיפוץ בדירה שלנו בתל אביב. הדירה ה...,f


(753, 2)

In [6]:
df_test.head(3)
df_test.shape

Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,"הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ""..."
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...


(323, 2)

### Your implementation:


   ### Yarin Akiva 
   ### 318424660 



Write your code solution in the following code-cells

# Preprocess


In [7]:
def clean_text(text):
    # Remove numbers, punctuation, and extra whitespace
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

In [8]:
def preprocess_data(data):
    # Clean the text
    data = clean_text(data)
    
    # Tokenize the text and keep only Hebrew tokens
    tokens = ht.tokenize(data)
    tokenized = ''
    for grp, token, token_num, (start_index, end_index) in tokens:
        if grp == 'HEBREW':
            tokenized += token + ' '
    
    return tokenized

In [9]:
df_train['story'] = df_train['story'].apply(preprocess_data)
df_test['story'] = df_test['story'].apply(preprocess_data)

In [10]:
print(df_train['story'])
print(df_test['story'])

0      כשחבר הזמין אותי לחול לא באמת חשבתי שזה יקרה פ...
1      לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...
2      מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...
3      כשהייתי ילד מטוסים היה הדבר שהכי ריתק אותי בתו...
4      הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכים...
                             ...                        
748    אז לפני שנה בדיוק טסתי לאמסטרדם עם שני חברים ט...
749    שבוע שעבר העליתי באופן ספונטני רעיון לנסוע עם ...
750    לפני חודש עברנו לדירה בבית שמש בעקבות משפחתי ה...
751    החוויה אותה ארצה לשתף התרחשה לפני כמה חודשים ז...
752    פעם כשהייתי בחו ל בקבולומביה כחלק מהטיול שלי ל...
Name: story, Length: 753, dtype: object
0      כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1      הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת י...
2      אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...
3      רגע הגיוס לצבא היה הרגע הכי משמעותי עבורי אני ...
4      אני הגעתי לברזיל ישר מקולומביה וגם אני עשיתי ע...
                             ...                

In [11]:
# Convert the target from categorial values to binary values
# '1' == M , '0' == W
df_train['gender'] = np.where(df_train['gender'] == 'm',1,0)

In [12]:
df_train['gender']
df_test.columns

0      1
1      1
2      0
3      1
4      0
      ..
748    1
749    1
750    1
751    0
752    1
Name: gender, Length: 753, dtype: int32

Index(['test_example_id', 'story'], dtype='object')

In [13]:
# split the training data set to feature vectors and target label
X_train = df_train['story']
y_train = df_train['gender']

In [14]:
print(y_train)

0      1
1      1
2      0
3      1
4      0
      ..
748    1
749    1
750    1
751    0
752    1
Name: gender, Length: 753, dtype: int32


# Cross Validation

In [16]:
# Define the pipeline for each model
# By using GridSearch, finding the best params for each model.
models = [
    {
        'name': 'Perceptron',
        'pipeline': Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('norm', Normalizer()),
            ('clf', Perceptron())
        ]),
        'params': {
            'tfidf__ngram_range': [(1,1), (1,2), (1,3)],
            'tfidf__min_df': [1, 2, 3],
            'clf__alpha': [0.0001, 0.001, 0.01]
        }
    },
    {
        'name': 'SGDClassifier',
        'pipeline': Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('norm', Normalizer()),
            ('clf', SGDClassifier())
        ]),
        'params': {
                'tfidf__max_df': [0.5, 0.75, 1.0],
    'clf__alpha': [0.0001, 0.001, 0.01, 0.1],
    'clf__penalty': ['l1', 'l2', 'elasticnet'],
    'clf__max_iter': [1000, 2000, 3000]
        }
    },
    {
        'name': 'MLPClassifier',
        'pipeline': Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('norm', Normalizer()),
            ('clf', MLPClassifier(
                            solver='lbfgs',
                alpha=0.0001,
                hidden_layer_sizes=(50,),
                max_iter=200,
                batch_size=100,
                early_stopping=True,
                validation_fraction=0.1
            ))
        ]),
        'params': {
            'tfidf__ngram_range': [(1,1), (1,2)],
            'tfidf__min_df': [1,2],
            'clf__alpha': [0.0001, 0.001, 0.01],
            'clf__hidden_layer_sizes': [(50,), (100,)]
        }
    },
    {
        'name': 'LinearSVC',
        'pipeline': Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('norm', Normalizer()),
            ('clf', LinearSVC())
        ]),
        'params': {
            'tfidf__ngram_range': [(1,1), (1,2), (1,3)],
            'tfidf__min_df': [1, 2, 3],
            'clf__C': [0.1, 1, 10]
        }
    }
]

# Iterate over each model, perform cross-validation and print the best parameters
best_model = None
best_f1_score = 0

for model in models:
    grid_search = GridSearchCV(model['pipeline'], model['params'], cv=10, scoring='f1_macro')
    grid_search.fit(X_train, y_train)
    
    print(model['name'])
    print('Best parameters:', grid_search.best_params_)
    print('Best F1 score:', grid_search.best_score_)
    
    if grid_search.best_score_ > best_f1_score:
        best_f1_score = grid_search.best_score_
        best_model = grid_search.best_estimator_

print('Best model:', best_model)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('norm', Normalizer()),
                                       ('clf', Perceptron())]),
             param_grid={'clf__alpha': [0.0001, 0.001, 0.01],
                         'tfidf__min_df': [1, 2, 3],
                         'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)]},
             scoring='f1_macro')

Perceptron
Best parameters: {'clf__alpha': 0.0001, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 3)}
Best F1 score: 0.6895123017228578


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('norm', Normalizer()),
                                       ('clf', SGDClassifier())]),
             param_grid={'clf__alpha': [0.0001, 0.001, 0.01, 0.1],
                         'clf__max_iter': [1000, 2000, 3000],
                         'clf__penalty': ['l1', 'l2', 'elasticnet'],
                         'tfidf__max_df': [0.5, 0.75, 1.0]},
             scoring='f1_macro')

SGDClassifier
Best parameters: {'clf__alpha': 0.0001, 'clf__max_iter': 3000, 'clf__penalty': 'l1', 'tfidf__max_df': 0.75}
Best F1 score: 0.7204172893068881


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('norm', Normalizer()),
                                       ('clf',
                                        MLPClassifier(batch_size=100,
                                                      early_stopping=True,
                                                      hidden_layer_sizes=(50,),
                                                      solver='lbfgs'))]),
             param_grid={'clf__alpha': [0.0001, 0.001, 0.01],
                         'clf__hidden_layer_sizes': [(50,), (100,), (150,)],
                         'tfidf__min_df': [1, 2],
                         'tfidf__ngram_range': [(1, 1), (1, 2)]},
             scoring='f1_macro')

MLPClassifier
Best parameters: {'clf__alpha': 0.0001, 'clf__hidden_layer_sizes': (50,), 'tfidf__min_df': 2, 'tfidf__ngram_range': (1, 1)}
Best F1 score: 0.6665449045617731


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('norm', Normalizer()),
                                       ('clf', LinearSVC())]),
             param_grid={'clf__C': [0.1, 1, 10], 'tfidf__min_df': [1, 2, 3],
                         'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)]},
             scoring='f1_macro')

LinearSVC
Best parameters: {'clf__C': 10, 'tfidf__min_df': 3, 'tfidf__ngram_range': (1, 1)}
Best F1 score: 0.6592812801724213
Best model: Pipeline(steps=[('tfidf', TfidfVectorizer(max_df=0.75)), ('norm', Normalizer()),
                ('clf', SGDClassifier(max_iter=3000, penalty='l1'))])


# Best model 

In [24]:
# Define the pipeline
best_model_sgd = Pipeline([
    ('tfidf', TfidfVectorizer(max_df=0.73)),
            ('norm', Normalizer(norm='l2')),
            ('clf', SGDClassifier(random_state=42))
        ])
        
# Define the parameter grid    
params_grid = {
            #Best params from grid search:
            #'clf__alpha': 0.0001, 'clf__max_iter': 3000, 'clf__penalty': 'l1', 'tfidf__max_df': 0.75
            'clf__alpha': [0.0001],
            'clf__max_iter': [3000],
            'clf__penalty': ['l1'],
            'tfidf__max_df': [0.75]
    }

# Perform grid search
grid_search = GridSearchCV(best_model_sgd,params_grid, cv=10, scoring='f1_macro')
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and cross-validation score
print("Best hyperparameters:", grid_search.best_params_)
print("Best f1 score:", grid_search.best_score_)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(max_df=0.73)),
                                       ('norm', Normalizer()),
                                       ('clf',
                                        SGDClassifier(random_state=42))]),
             param_grid={'clf__alpha': [0.0001], 'clf__max_iter': [3000],
                         'clf__penalty': ['l1'], 'tfidf__max_df': [0.75]},
             scoring='f1_macro')

Best hyperparameters: {'clf__alpha': 0.0001, 'clf__max_iter': 3000, 'clf__penalty': 'l1', 'tfidf__max_df': 0.75}
Best f1 score: 0.7140278157228526


# Average f1 score for best model

In [25]:
# Define the pipeline with the best parameters
SGDClassifier_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_df=0.75, ngram_range=(1,1))),
    ('normalize', preprocessing.Normalizer(norm='l2')),
    ('clf', SGDClassifier(alpha=0.0001, max_iter=3000, penalty='l1', random_state=42))
])

# Calculate the average f1 score using cross-validation
f1_scores = cross_val_score(SGDClassifier_pipeline, X_train, y_train, cv=10, scoring='f1_macro')
avg_f1_score = np.mean(f1_scores)

# Print the average f1 score
print("Average F1 score:", avg_f1_score)

Average F1 score: 0.7140278157228526


### Predict the best model on test
### Build the df file with the prediction

In [27]:
# use the best estimator model for learning
SGDClassifier_pipeline.fit(X_train, y_train)
y_pred = SGDClassifier_pipeline.predict(df_test["story"])
y_pred = np.where(y_pred == 1, 'm', 'f')
pd.DataFrame({'id_test': df_test["test_example_id"], 'y_pred': y_pred}).head(5)
pd.DataFrame({'id_test': df_test["test_example_id"], 'y_pred': y_pred}).tail(5)



# create dataframe with the test results
df_predicted = pd.DataFrame(
    {'test_example_id': df_test['test_example_id'], 'predicted_category': y_pred})
df_predicted

Pipeline(steps=[('tfidf', TfidfVectorizer(max_df=0.75)),
                ('normalize', Normalizer()),
                ('clf',
                 SGDClassifier(max_iter=3000, penalty='l1', random_state=42))])

Unnamed: 0,id_test,y_pred
0,0,m
1,1,m
2,2,m
3,3,m
4,4,m


Unnamed: 0,id_test,y_pred
318,318,m
319,319,m
320,320,m
321,321,m
322,322,m


Unnamed: 0,test_example_id,predicted_category
0,0,m
1,1,m
2,2,m
3,3,m
4,4,m
...,...,...
318,318,m
319,319,m
320,320,m
321,321,m


### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [29]:
df_predicted.to_csv('classification_results.csv',index=False)