<a href="https://colab.research.google.com/github/fmigone/App_Codification_Python_CAE/blob/main/ml_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We have to import all the python libraries that we will need in our classification.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

Now we read our excel file to the dataframe (a dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns).
I have prepared the file with 60 sentences: 20 in polish, 20 in english and 20 in spanish. The task for the classifier is to learn how to classify the sentence to the correct language.

In [None]:
df = pd.read_excel(r'https://github.com/kristhina/ml_tutorial/blob/master/data.xlsx?raw=true')
X = df['name']
y = df['category']
df

We have to divide our dataset into train and test group. In the case below the test group size will be 30% of all provided data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4)

In order to classify text you have to vectorize it (every word is represented by 0 if it does not occur in the sentence and 1 if it occurs).

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train)
X_train=vectorizer.transform(X_train)

Now we choose the classifier - logistic regression and we train it with our x and y train data.

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

And we are ready to test our classifier with the x test data.

In [None]:
X_test_v = vectorizer.transform(X_test)
y_predicted = clf.predict(X_test_v)
max_probability = clf.predict_proba(X_test_v).max(axis=1)

We can save all the results in one dataframe to show the sentence, correct language, predicted language and the probability of our prediction

In [None]:
df_results = pd.DataFrame({'sentence' : X_test, 'correct category' : y_test, 'predicted category' : y_predicted, 'probability': max_probability})
df_results

We can also see the classification report (with precision, recall, F1-score and accuracy)

In [None]:
print(classification_report(y_test, y_predicted))

And now it is the time to enter our own sentence to classify.

In [None]:
sentence_to_classify = input('enter the sentence to classify (in polish, english or spanish): ')

We have to change it in the vectorized list.

In [None]:
list_to_predict = list()
list_to_predict.append(sentence_to_classify)
list_to_predict = vectorizer.transform(list_to_predict)

And now we are ready to predict using our logistic regression classifier.

In [None]:
predicted = clf.predict(list_to_predict)[0]
predicted_probability = clf.predict_proba(list_to_predict)
best_predicted_probability = round(predicted_probability.max(axis=1)[0],4)

In [None]:
print("You have written the sentence: {}".format(sentence_to_classify))
print("We predict that this is sentence in {} with the probability {}.".format(predicted, best_predicted_probability))