# Basic Example of Text Extraction using Pipeline

In [18]:
### ~ Let's import some packages ~ ###
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, make_scorer

### Import datas for real ###
movie_review_raw = pd.read_csv('./moviereviews2.tsv', sep='\t')

In this Notebook, there is a study about a little DataSet, which came from Udemy. In this Notebook, there are :
* Some basics treatments and Data Cleaning ;
* A comparaison between different methods to solve a classification issue, with differents Vectorizers and Ml- models ;
* MORE ?.


## First Step : Data Cleaning and Some Visualisation 

### What about missing values ? 

In [2]:
### Check for NaN values ###
movie_review_raw.isna().sum()

label      0
review    20
dtype: int64

In [3]:
### Check for unofficial NaN values in texts, such as " " ###
movie_review_raw[movie_review_raw['review']==" "]

Unnamed: 0,label,review


##### Delete missing values

In [4]:
### Let's drop Official Nan Values ###
movie_review_raw.dropna(inplace=True)

### Reseting indexes ###
movie_review_raw.reset_index(drop=True, inplace=True)

### About our Target

In [25]:
### Label's review : is this a balanced or unbalanced Dataframe ? ###
movie_review_raw.label.value_counts()

neg    3000
pos    3000
Name: label, dtype: int64

## Second Step : Data-Processing before modelisation

In [6]:
movie_review_raw

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...
...,...,...
5975,pos,"Of the three remakes of this plot, I like them..."
5976,neg,Poor Whoopi Goldberg. Imagine her at a friend'...
5977,neg,"Honestly before I watched this movie, I had he..."
5978,pos,This movie is essentially shot on a hand held ...


In [7]:
from sklearn.model_selection import train_test_split

### Definiton of target's name and features ###
target = 'label'
features = [x for x in movie_review_raw.columns.tolist() if x != "label"]

### Splitting into train test df ###
X_train, X_test, y_train, y_test = train_test_split(movie_review_raw[features], movie_review_raw[target],
                                                    test_size=0.33, random_state=42)

X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

In [8]:
### ENCODING VALUES OF TARGET ###
y_train_encoded = y_train.apply(lambda u : 1 if u =='pos' else 0)
y_test_encoded = y_test.apply(lambda u : 1 if u =='pos' else 0)

## Third Step : Modelisation and Scoring

From now, the goal is to compare : 
1. Differents methods of vectorisation : CountVectorizer, TfidfVectorizer... 
2. Differents metrics : accuracy, Gini, AUC...

## Creation of different scorers 

Just a quick Reminder : let's deal with ROC curve. Binary classifier aims at predicting probabilites to be in the class 0 or 1. With this probability, data scientists choose a threshold in order to classify probabilites.

For instance, if you're looking at probabilities to be in the class 0, it is possible to put push a threshold such as :
1. If the probabilty is under this threshold, the predicted value is a 1 ;
2. Either, it's a 0.

So, you can make a function of thresholds and plot the pourcent of observations predicted as 0 depending on the threshold. Also, you can plot it : that's the **ROC CURVE**.

##### AUC

AUC is basically the area under the ROC curve.

In [27]:
### Using Maths formula to create this scorer from scratch ###
def AUC(y, y_pred, sample_weight=None):
    return (roc_auc_score(y,y_pred, sample_weight=sample_weight)*100)

### Making it a socrer ###
AUC = make_scorer(AUC, greater_is_better=True, needs_proba=True)

##### Gini

In [29]:
### Using Maths formula to create this scorer from scratch ###
def Gini(y, y_pred, sample_weight=None):
    return ((2*roc_auc_score(y,y_pred, sample_weight=sample_weight)-1)*100)

### Making it a socrer ###
Gini = make_scorer(Gini, greater_is_better=True, needs_proba=True)

##### Trying with TFIDF and SVC

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


pipe_TFIDF_SVC = Pipeline([('vectorization', TfidfVectorizer()), 
                ('clf_SVC', SVC(probability=True))
                ])

pipe_TFIDF_SVC.fit(X_train[features[0]], y_train_encoded)

Gini(pipe_TFIDF_SVC, X_test[features[0]], y_pred)



ValueError: continuous-multioutput format is not supported

##### Trying with TFIDF and LR 

In [23]:
pipe_TFIDF_LR = Pipeline([('vectorization', TfidfVectorizer()), 
                ('clf_LR', LogisticRegression())
                ])

pipe_TFIDF_LR.fit(X_train[features[0]], y_train_encoded)

y_pred = pipe_TFIDF_LR.predict_proba(X_test[features[0]])

Gini(pipe_TFIDF_LR, X_test[features[0]], y_test_encoded)



94.11796709551786

##### Trying with TFIDF and XGB

In [90]:
pipe_TFIDF_LR = Pipeline([('vectorization', TfidfVectorizer()), 
                ('clf_XGB', XGBClassifier())
                ])

pipe.fit(X_train[features[0]], y_train_encoded)

y_pred = pipe.score(X_test[features[0]])

0.9219858156028369

In [75]:
vect = TfidfVectorizer()
dtm = vect.fit(X_train['review'])
dtm.transform(X_train['review'])

<4006x28477 sparse matrix of type '<class 'numpy.float64'>'
	with 437173 stored elements in Compressed Sparse Row format>

In [46]:
X_train

Unnamed: 0,review
192,Why do people who do not know what a particula...
4675,"Drum scene is wild! Cook, Jr. is unsung hero o..."
5379,For long time I haven't seen such a good fanta...
4630,Although it got some favorable press after pla...
4983,Not a bad word to say about this film really. ...
...,...
3772,It Could Have Been A Marvelous Story Based On ...
5191,EDDIE MURPHY DELIRIOUS is easily the funniest ...
5226,What a joke. I am watching it on Channel 1 and...
5390,Why does this have such a low rating? I really...


### Task #7: Run predictions and analyze the results

In [None]:
# Form a prediction set


In [None]:
# Report the confusion matrix



In [None]:
# Print a classification report


In [None]:
# Print the overall accuracy


## Great job!