# Assignment 3 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Personal Details:

In [1]:
# Details Student 1: 
# Name:Tal Yaakobi
# ID: 315788653 
# Mail: taltal2345@gmail.com

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [2]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [3]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [4]:
# word net installation:

# unmark if you want to use and need to install
# !pip install wn
# !python -m wn download omw-he:1.4

In [5]:
# word net import:

# unmark if you want to use:
# import wn

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [6]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
# !pip install hebrew_tokenizer

In [7]:
# Hebrew tokenizer import:

# unmark if you want to use:
import hebrew_tokenizer as ht

C:\Users\talta\OneDrive - Open University of Israel\שולחן העבודה\אחסון\לימודים טל\מדעי המחשב\שנה ב\סמסטר ג\למידת מכונה\מטלה 3\assignment3-text_analysis\assignment3-text_analysis-new new


### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [8]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [9]:
df_train.head(8)
df_train.shape

Unnamed: 0,story,gender
0,"כשחבר הזמין אותי לחול, לא באמת חשבתי שזה יקרה,...",m
1,לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...,m
2,מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...,f
3,"כשהייתי ילד, מטוסים היה הדבר שהכי ריתק אותי. ב...",m
4,‏הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכי...,f
5,לפני כ3 חודשים טסתי לרומא למשך שבוע. טסתי במטו...,f
6,אני כבר שנתיים נשוי והשנה אני ואישתי סוף סוף י...,m
7,השנה התחלנו שיפוץ בדירה שלנו בתל אביב. הדירה ה...,f


(753, 2)

In [10]:
df_test.head(3)
df_test.shape

Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,"הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ""..."
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...


(323, 2)

### Your implementation:
Write your code solution in the following code-cells

## Removal of Numbers and Special Characters: 
To clean the text data, we remove numbers and special characters that may not provide valuable information to the models.

In [11]:
def remove_numbers_and_special_characters(text):
    return re.sub(r'[^א-ת\s]', '', text)

## Data Preprocessing:
The project begins with data preprocessing, where the provided dataset containing stories in Hebrew is prepared for machine learning tasks.
The following preprocessing steps are performed: 
- Tokenization: The text data is tokenized using a Hebrew tokenizer (assumed to be in the 'ht' object) to break it into individual words or tokens.
- Removal of Numbers and Special Characters: Numbers and special characters that may not provide valuable information to the models are removed from the text data.
- Vectorization: The tokenized and cleaned text data is transformed into numerical feature vectors using the CountVectorizer from scikit-learn. This process involves converting words into columns and counting their occurrences in each document, creating a matrix suitable for machine learning.

In [12]:
df_train['gender'] = df_train['gender'].map({'f': 0, 'm': 1})

In [13]:
def preprocess_data(df_train, df_test):
    # Tokenize the stories
    for i, row in df_train.iterrows():
        tokens = ht.tokenize(row['story'])  # Tokenize the story
        row['story'] = ' '.join(token for grp, token, _, _ in tokens)  # Join tokens into a string

    # Replace the 'story' column in the test DataFrame with tokens
    for i, row in df_test.iterrows():
        tokens = ht.tokenize(row['story'])  # Tokenize the story
        row['story'] = ' '.join(token for grp, token, _, _ in tokens)  # Join tokens into a string

    # Remove numbers and special characters
    df_train['story'] = df_train['story'].apply(remove_numbers_and_special_characters)
    df_test['story'] = df_test['story'].apply(remove_numbers_and_special_characters)

    # Create the feature vectors
    X_train = df_train['story']
    y_train = df_train['gender']
    X_test = df_test['story']
    
    # Using CountVectorizer
    vectorizer = CountVectorizer()
    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)

    return X_train, y_train, X_test

## Model Selection and Evaluation:
- Multiple machine learning models are considered for the classification task, including Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Decision Trees, Naive Bayes (MultinomialNB), and Logistic Regression.
- For each model, a grid search is performed to identify the best hyperparameters using cross-validation. The chosen metric for optimization is the F1 score with macro averaging, which considers precision and recall across classes.
- To evaluate model performance, the F1 score is calculated separately for male and female authors. A macro average of these scores is computed to provide an overall assessment of the model's ability to classify authors by gender.
- The model with the highest macro-averaged F1 score is selected as the best model for the given classification task.

In [14]:
def get_best_model(model, **parameters):
    grid_search = GridSearchCV(model, parameters, scoring="f1_macro", cv=10)
    grid_search.fit(X_train, y_train)
    macro_f1 = grid_search.best_score_
    return grid_search.best_estimator_, macro_f1

In [15]:
# Preprocess the data
X_train, y_train, X_test = preprocess_data(df_train, df_test)

In [16]:
X_train.shape
y_train.shape
X_test.shape

(753, 34933)

(753,)

(323, 34933)

In [17]:
# Define models:

# LogisticRegression
logistic_regression = LogisticRegression(solver="lbfgs")
logistic_regression_params = {
    'C': [0.01, 0.1, 1.0, 10.0],
    'penalty': ['l1', 'l2'],
    'max_iter': [100, 200, 300]
}

# KNeighborsClassifier
knn_classifier = KNeighborsClassifier()
knn_classifier_params = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance']
}

# LinearSVC
linear_svc = LinearSVC()
linear_svc_params = {
    'C': [0.1, 1.0, 10.0],
    'max_iter': [100, 200, 300]
}

# DecisionTreeClassifier
decision_tree_classifier = DecisionTreeClassifier()
decision_tree_classifier_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# MultinomialNB
multinomial_nb = MultinomialNB()
multinomial_nb_params = {
    'alpha': [0.1, 0.5, 1.0]
}

In [18]:
models = [
    ("Logistic Regression", logistic_regression, logistic_regression_params),
    ("K-Nearest Neighbors", knn_classifier, knn_classifier_params),
    ("LinearSVC", linear_svc, linear_svc_params),
    ("Decision Trees", decision_tree_classifier, decision_tree_classifier_params),
    ("MultinomialNB", multinomial_nb, multinomial_nb_params)
]

## Final Model and Prediction:
- After model selection, the best model is chosen based on its macro-averaged F1 score.
- The selected model is trained on the preprocessed training data.
- The trained model is then used to make predictions on the test dataset, which contains stories without gender labels.
- The predictions can be used to classify the authors of these stories into male or female categories.

In [19]:
best_model = None
best_macro_f1 = -np.inf

for model_name, model, parameters_model in models:
    print(model_name)
    best_model_tmp, best_macro_f1_tmp = get_best_model(model, **parameters_model)
    
    if best_macro_f1_tmp > best_macro_f1:
        best_model = best_model_tmp
        best_macro_f1 = best_macro_f1_tmp

print("Best model:", best_model)
print("Best macro_f1:", best_macro_f1)

Logistic Regression
K-Nearest Neighbors
LinearSVC
Decision Trees
MultinomialNB
Best model: LinearSVC(C=10.0, max_iter=200)
Best macro_f1: 0.6845574973637392


In [20]:
predicted_genders = best_model.predict(X_test)
df_test['predicted_category'] = predicted_genders
df_test['predicted_category'] = df_test['predicted_category'].replace({0: 'f', 1: 'm'})
df_test.drop(columns='story', inplace=True)

In [21]:
df_test.head()
df_test.tail()

Unnamed: 0,test_example_id,predicted_category
0,0,m
1,1,m
2,2,m
3,3,m
4,4,f


Unnamed: 0,test_example_id,predicted_category
318,318,m
319,319,m
320,320,m
321,321,m
322,322,m


### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [22]:
df_test.to_csv('classification_results.csv',index=False)