### **Natural language processing basic model**

In [1]:
#Basic Python Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

###Natural Language Processing Library



1.   nltk - Natural Language Toolkit is a collection of libraries for natural language processing
2.   stopwords - collection of words that don't provide any meaning to a sentence
3.  WordNetLemmatizer - used to convert different forms of words into a single item but still keeping the context intact





In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

###Machine Learning Library



1.   CountVectorizer - transform text to vectors
2.   GridSearchCV - for hyperparameter tuning
3.  RandomForestClassifier - machine learning algorithm for classification

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

### Evaluation Metrics
1. Accuracy Score - no. of correctly classified instances/total no. of istances
2. Precision Score
3. Recall Score
4. Roc Curve
5. Classification Report
6. Confusion Matrix

In [None]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,roc_curve,classification_report
#from scikitplot.metrics import plot_confusion_matrix

In [11]:
from google.colab import files
uploaded = files.upload()


Saving test.txt to test.txt
Saving train.txt to train.txt
Saving val.txt to val.txt


Now, we will read the training data and validation data.

In [12]:
df_train = pd.read_csv("train.txt",delimiter=';',names=['text','label'])
df_val = pd.read_csv("val.txt",delimiter=';',names=['text','label'])

Now, we will concatenate these two data frames

In [13]:
df = pd.concat([df_train,df_val])

In [14]:
print(df.shape)

(18000, 2)


In [17]:
df.reset_index(inplace=True,drop=True)

In [18]:
df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [32]:
sns.countplot(df.label)

array([0, 1])

####As we can see that, we have 6 labels or targets in the dataset. We will merge these labels into two classes, which is Positive or Negative sentiment.

1. Positive Sentiment - "joy","love","surprise"
2. Negative Sentiment - "anger","sadness","fear"

We will create a custom encoder to convert categorical target labels to numerical form 0 and 1, where
0 stands for negative reviews, and
1 stands for positive reviews

In [20]:
def custom_encoder(df):
    df.replace(to_replace ="surprise", value =1, inplace=True)
    df.replace(to_replace ="love", value =1, inplace=True)
    df.replace(to_replace ="joy", value =1, inplace=True)
    df.replace(to_replace ="fear", value =0, inplace=True)
    df.replace(to_replace ="anger", value =0, inplace=True)
    df.replace(to_replace ="sadness", value =0, inplace=True)

In [21]:
custom_encoder(df['label'])

In [22]:
df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,0
3,i am ever feeling nostalgic about the fireplac...,1
4,i am feeling grouchy,0


Now, we will perform some pre-processing on the data before converting it into vectors and passing it to the machine learning model.

In [23]:
#object of WordNetLemmatizer
lm = WordNetLemmatizer()

**a function for pre-processing of data**

1. First, we will iterate through each record and using regualr expression, we will get rid of any characters apart from alphabets.
2. Then, we will convert the string to lowercase as, the word "Good" is different from the word "good". This will cause an issue, when we will create vectors of these words, as two different vectors will be created for the same word which we don't want to.
3. Then we will check for stopwords in the data and get rid of them. Stopwords are commonly used words in a sentence such as "the","an","to" etc. which do not add much value.
4. Then, we will perform lemmatization on each word,i.e. change the different forms of word into a single item called as lemma. A lemma is a base form of a word. For example, run, running and runs are all forms of same lexeme where run is the lemma. Hence, we are converting all occurrences of same lexeme to it's respective lemma.
5. And, then return a corpus of processed data



In [5]:
def text_transformation(df_col):
    corpus = []
    for item in df_col:
        new_item = re.sub('[^a-zA-Z]',' ',str(item))
        new_item = new_item.lower()
        new_item = new_item.split()
        new_item = [lm.lemmatize(word) for word in new_item if word not in set(stopwords.words('english'))]
        corpus.append(' '.join(str(x) for x in new_item))
    return corpus

In [24]:
corpus = text_transformation(df['text'])

- Now, we will use Bag of Words Model(BOW), which is used to represent the text in the form of bag of words
- the grammar and the order of words in a sentence is not given any importance, instead multiplicity,i.e. (the number of times a word occurs in a document) is the main point of concern.
- Basically, it describes the total occurrence of words within a document.


- sklearn provides a neat way of performing bag of words technique using CountVectorizer


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

Now, we will convert the text data into vectors, by fitting and transforming the corpus.
We will take ngram_range as (1,2) which signifies a bigram.
Ngram is a sequence of 'n' of words in a row or sentence. 'ngram_range' is parameter which we use to give importance to combination of words as well, such as, "social media" has different meaning then separate words such as "social" and "media".
We can use experiment with the value of this parameter and select the option which gives better results.

In [25]:
cv = CountVectorizer(ngram_range=(1,2))
traindata = cv.fit_transform(corpus)

In [26]:
X = traindata
y = df.label

In this project, I'm going to use Random Forest Classifier, and we will tune the hyperparameters using GridSearchCV.
We will create a dictionary, "parameters" which will contain the values of different hyperparameters, which we will pass to GridSearchCV to train our random forest classifier model using all possible combinations of the parameters.


In [27]:
parameters = {'max_features': ('auto','sqrt'),
             'n_estimators': [500, 1000],
             'max_depth': [10, None],
             'min_samples_split': [5],
             'min_samples_leaf': [1],
             'bootstrap': [True]}

Now, our GridSearchCV() will take the following parameters,

  1. Estimator or model - RandomForestClassifier in our case
  2. parameters - dictionary of hyperparameter names and their values
  3. cv - signifies cross validation folds
  4. return_train_score - returns the training scores of the various maodels
  5. n_jobs - no. of jobs to run parallely ("-1" signifies that all CPU cores will be used which reduces the training time drastically)



In [None]:
grid_search = RandomizedSearchCV(RandomForestClassifier(),parameters,cv=5,return_train_score=True,n_jobs=-1,n_iter=3)
grid_search.fit(X,y)
grid_search.best_params_

#### Displays models and their respective parameters, mean test score and rank. 

In [None]:
for i in range(6):
    print('Parameters: ',grid_search.cv_results_['params'][i])
    print('Mean Test Score: ',grid_search.cv_results_['mean_test_score'][i])
    print('Rank: ',grid_search.cv_results_['rank_test_score'][i])

#### Now, we will choose the best parameters obtained from GridSearchCV and create a final random forest classifier model

In [None]:
rfc = RandomForestClassifier(max_features=grid_search.best_params_['max_features'],
                                      max_depth=grid_search.best_params_['max_depth'],
                                      n_estimators=grid_search.best_params_['n_estimators'],
                                      min_samples_split=grid_search.best_params_['min_samples_split'],
                                      min_samples_leaf=grid_search.best_params_['min_samples_leaf'],
                                      bootstrap=grid_search.best_params_['bootstrap'])

#### Now, we will train our new model.

In [None]:
rfc.fit(X,y)

## Test Data Transformation

In [None]:
test_df = pd.read_csv('test.txt',delimiter=';',names=['text','label'])

In [None]:
X_test,y_test = test_df.text,test_df.label
#encode the labels into two classes , 0 and 1
test_df = custom_encoder(y_test)
#pre-processing of text
test_corpus = text_transformation(X_test)
#convert text data into vectors
testdata = cv.transform(test_corpus)
#predict the target
predictions = rfc.predict(testdata)

###Model evaluation

In [None]:
acc_score = accuracy_score(y_test,predictions)
pre_score = precision_score(y_test,predictions)
rec_score = recall_score(y_test,predictions)
print('Accuracy_score: ',acc_score)
print('Precision_score: ',pre_score)
print('Recall_score: ',rec_score)
print('-------------------------------------------------------------------')
cr = classification_report(y_test,predictions)
print(cr)

In [None]:
predictions_probability = rfc.predict_proba(testdata)

In [None]:


fpr,tpr,thresholds = roc_curve(y_test,predictions_probability[:,1])
plt.plot(fpr,tpr)
plt.plot([0,1])
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()



## Predict for Custom Input

In [None]:
def expression_check(prediction_input):
    if prediction_input == 0:
        print("Input statement has Negative Sentiment.")
    elif prediction_input == 1:
        print("Input statement has Positive Sentiment.")
    else:
        print("Invalid Statement.")

In [None]:
def sentiment_predictor(input):
    input = text_transformation(input)
    transformed_input = cv.transform(input)
    prediction = rfc.predict(transformed_input)
    expression_check(prediction)

In [None]:
input1 = ["Sometimes I just want to punch someone in the face."]
input2 = ["I bought a new phone and it's so good."]
sentiment_predictor(input1)
sentiment_predictor(input2)