# Text Classification

_____

## Table of Contents
- [Importing libraries](#Importing-libraries)
- [Load data](#Load-data)
- [Data Cleaning and Preparation](#Data-Cleaning-and-Preparation)
- [Data Exploration & Visualization](#Data-Exploration-&-Visualization)
- [Machine Learning](#Machine-Learning)

_____

## Importing libraries

In [1]:
import pandas as pd
import pickle

_____

## Load data
source: https://www.kaggle.com/ashishpatel26/sentimental-analysis-nlp

In [2]:
df = pd.read_csv('../data/raw/sentiment_data.csv', header=None, names=['Label', 'Text'], sep='\t')

### Check the dimensions 

In [3]:
df.sample(10)

Unnamed: 0,Label,Text
4340,0,The Da Vinci Code sucked big time.
1180,1,Loved the Mission Impossible quip and the fact...
4461,0,"Da Vinci Code = Up, Up, Down, Down, Left, Righ..."
1454,1,"I liked the first "" Mission Impossible."
5946,0,"I hate Harry Potter, that daniel wotshisface n..."
3777,1,Brokeback Mountain was an AWESOME movie.
577,1,The Da Vinci Code is awesome..
4199,0,Da Vinci Code sucked..
3167,1,Brokeback Mountain was an AWESOME movie.
1766,1,i love kirsten / leah / kate escapades and mis...


In [4]:
df.shape

(6918, 2)

_____

## Data Cleaning and Preparation

### Check columns names

In [5]:
df.columns

Index(['Label', 'Text'], dtype='object')

### Check for Nulls

In [6]:
df.isnull().values.any()

False

_____

## Machine Learning: LinearSVC

In [7]:
import sklearn
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

### Define Output and Inputs

In [8]:
y = df['Label']
X = df['Text']

### Split dataset

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [10]:
X_train.shape, X_test.shape

((5534,), (1384,))

In [11]:
y_train.shape, y_test.shape

((5534,), (1384,))

### Vectorizer

In [12]:
tfidf_vect = TfidfVectorizer(max_features=15)

In [13]:
X_trans = tfidf_vect.fit_transform(X_train)

### Instantiate LinearSVC

In [14]:
classifier = LinearSVC(C=1.0, max_iter=1000, tol=1e-3)

In [15]:
linear_svc_model = classifier.fit(X_trans, y_train)



In [16]:
X_test_trans = tfidf_vect.fit_transform(X_test)

In [17]:
y_pred = linear_svc_model.predict(X_test_trans)

### Score

In [18]:
pred_results = pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})

In [19]:
accuracy = accuracy_score(y_test, y_pred)

In [20]:
accuracy

0.8865606936416185

### Pipeline

In [21]:
from sklearn.pipeline import Pipeline

In [22]:
clf_pipeline = Pipeline(steps=[('tfidf_vect', tfidf_vect), ('classifier', classifier)])

In [23]:
pipeline_model = clf_pipeline.fit(X_train, y_train)



In [24]:
y_pred = pipeline_model.predict(X_test) #can use data directly

In [25]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.8865606936416185

_____

## Save model

In [26]:
with open('../src/model/model.pkl', 'wb') as file:  
    pickle.dump(pipeline_model, file) 

In [33]:
text= 'bad'
prediction = pipeline_model.predict([text])
if prediction == 1:
    print('😀 Positive')
else:
    print('😓 Negative') 

😓 Negative
