In [1]:
import pandas as pd
import string

In [2]:
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

# Data pre-processing
As we have several words in all the comments, and we need to transform the into numbers, we will use the **CountVectorizer** to convert a collection of text documents to a matrix of token counts.

This will create a matrix that will be the number of rows times the number of unique words in the corpus.

Prior to tokenize the words, first we removed the puntuaction words with the following function:
```python
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
```


## Creating the DataFrame

For explanation purposes we will use a simple example of how the **CountVectorizer** works

### We'll create the following DataFrame

|   | text                                                                           |
|---|--------------------------------------------------------------------------------|
| 0 | Hello everybody                                                                |
| 1 | my name is Jose Alvarez everybody                                              |
| 2 | my age is 32 years, and have been living in CDMX for 32 years also everybody.  |

We will also remove the puntuaction with **remove_punctuations**

In [3]:
new_comment = [{"text":"Hello everybody"}, {"text":"my name is Jose Alvarez everybody"}, {"text":"my age is 32 years, and have been living in CDMX for 32 years also everybody."}]

In [4]:
new_df = pd.DataFrame(new_comment)
new_df["text"] = new_df["text"].apply(remove_punctuations)

In [5]:
new_df

Unnamed: 0,text
0,Hello everybody
1,my name is Jose Alvarez everybody
2,my age is 32 years and have been living in CDM...


In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## CountVectorizer

This function will tokenize all of the words in the corpus of the *DataFrame*

**CountVectorizer** requires some key arguments that will determine the tokenized output.

```python
CountVectorizer(ngram_range=(1, 1), stop_words='english')
```
***ngram_range*** will determine the *range of n-values for different word n-grams*. This means what would be the minimum combination of words and the maximum. 

A value of "1,1" means that the minimum is one word and the maximum is also one word.

A 2,2 value, will mean that it will find unique combination of two words, thus making the array larger.

We, also need to remove common english words that add no value to the conversations. We can add the argument ***stop_words*** to the **CountVectorizer**. This argument only works on the english language.

#### Using ngram_range 1,1 or 2,2

In [28]:
vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words='english')
vectorizer.fit(new_df['text'].values)
vectorized = vectorizer.transform(new_df['text'].values)
listofwords = vectorizer.get_feature_names()
singleword = pd.DataFrame(vectorized.toarray(), columns=listofwords)
singleword.head()

Unnamed: 0,32,age,alvarez,cdmx,everybody,hello,jose,living,years
0,0,0,0,0,1,1,0,0,0
1,0,0,1,0,1,0,1,0,0
2,2,1,0,1,1,0,0,1,2


As we can see, the **CountVectorizer** function extracts all the unique words and create an array

In [26]:
vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words='english')
vectorizer.fit(new_df['text'].values)
vectorized = vectorizer.transform(new_df['text'].values)
listofwords = vectorizer.get_feature_names()
doubleword = pd.DataFrame(vectorized.toarray(), columns=listofwords)
doubleword.head()

Unnamed: 0,32 years,age 32,alvarez everybody,cdmx 32,hello everybody,jose alvarez,living cdmx,years everybody,years living
0,0,0,0,0,1,0,0,0,0
1,0,0,1,0,0,1,0,0,0
2,2,1,0,1,0,0,1,1,1


In this case, both arguments resulted in a equal number of unique elements. However, later we will se that this is not always the same

## TfidfTransformer

Transform a count matrix to a normalized tf or tf-idf representation

Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.

The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.
\begin{equation}
tf-idf_{t,d} = (1 +\log tf_{t,d}) \cdot \log \frac{N}{df_t}
\end{equation}

In [29]:
idf_transformer = TfidfTransformer()
idf_transformer.fit(vectorized)
transformed = idf_transformer.transform(vectorized)

In [30]:
transformedDF = pd.DataFrame(transformed.toarray(), columns=listofwords)
transformedDF.head()

Unnamed: 0,32,age,alvarez,cdmx,everybody,hello,jose,living,years
0,0.0,0.0,0.0,0.0,0.508542,0.861037,0.0,0.0,0.0
1,0.0,0.0,0.652491,0.0,0.385372,0.0,0.652491,0.0,0.0
2,0.593683,0.296841,0.0,0.296841,0.175319,0.0,0.0,0.296841,0.593683


# Pre-Processing our Data

Our data frame uses a 50000 rows by 2 columns, each column containing a comment from the "IMDB" website and the classifications of **1 (Positive)** or **2 (Negative)**

|    | text                                                | label  |
|--------|------------------------------------------------------|--------|
| 0      | Easily the worst movie I have ever seen in my \.\.\. | 0      |
| 1      | Ambushed is no ordinary action flick Its much \.\.\. | 0      |
| 2      | I loved this movie but then again I am a big C\.\.\. | 1      |
| 3      | In 1933 Dick Powell and Ruby Keeler sang and d\.\.\. | 1      |
| 4      | To make any film about the supposed end of the\.\.\. | 0      |
| \.\.\. | \.\.\.                                               | \.\.\. |
| 49995  | While its true that the movie is somewhat inte\.\.\. | 0      |
| 49996  | From the upper shelf of great Classic Books co\.\.\. | 1      |
| 49997  | Good ideashame about the actual movie Would of\.\.\. | 0      |
| 49998  | An unusual film for an audience outside the US\.\.\. | 1      |
| 49999  | I really enjoyed The 60s Not being of that gen\.\.\. | 1      |

Using the **CountVectorizer** with ngram_range of 1,1 will create an array of 50,000 rows by over 180,000 columns, while using a 2,2 ngram_range will create an array of 50,000 rows by over 3,120,000 columns.

Both arrays were also transformed using the ***TfidfTransformer*** function 

# Model

For our model we selected the ***SGDClassifier*** since it is a linear classifiers which can use SVM, logistic regression among others. For our model we wnet with the SVM as a default of the ***SGDClassifier***

```python
model = SGDClassifier()
```

We also divided our data set with a train_test_split() and feed it into our model

```python
X_train, X_test, y_train, y_test = train_test_split(transformed, y)

clf.fit(X_train, y_train)
```

We trained the model with the unigram (ngram_range=1,1) and bigram (ngram_range=2,2) and our scores where as follows

* unigram score train = 93.17.
* unigram score test = 80.10.

* bigram score train = 90.38.
* bigram score test = 73.32.

Thus, we used the ngram_range 1,1 