# Text Classification

_____

## Table of Contents
- [Importing libraries](#Importing-libraries)
- [Load data](#Load-data)
- [Data Cleaning and Preparation](#Data-Cleaning-and-Preparation)
- [Data Exploration & Visualization](#Data-Exploration-&-Visualization)
- [Machine Learning](#Machine-Learning)

_____

## Importing libraries

In [1]:
import pandas as pd

_____

## Load data
source: https://www.kaggle.com/ashishpatel26/sentimental-analysis-nlp

In [2]:
df = pd.read_csv('../data/sentiment_data.csv', header=None, names=['Label', 'Text'], sep='\t')

### Check the dimensions 

In [3]:
df.sample(10)

Unnamed: 0,Label,Text
828,1,The Da Vinci Code is awesome!!
1787,1,mission impossible 2 rocks!!....
3311,1,"Anyway, thats why I love "" Brokeback Mountain."
6335,0,"As I sit here, watching the MTV Movie Awards, ..."
1426,1,i love kirsten / leah / kate escapades and mis...
3787,1,Brokeback Mountain was an AWESOME movie.
29,1,The Da Vinci Code's backtory on various religi...
229,1,Love luv lubb the Da Vinci Code!
5915,0,These Harry Potter movies really suck.
1637,1,mission impossible 2 rocks!!....


In [4]:
df.shape

(6918, 2)

_____

## Data Cleaning and Preparation

### Check columns names

In [5]:
df.columns

Index(['Label', 'Text'], dtype='object')

### Check for Nulls

In [6]:
df.isnull().values.any()

False

_____

## Machine Learning: LinearSVC

In [7]:
import sklearn
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

### Define Output and Inputs

In [8]:
y = df['Label']
X = df['Text']

### Split dataset

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [10]:
X_train.shape, X_test.shape

((5534,), (1384,))

In [11]:
y_train.shape, y_test.shape

((5534,), (1384,))

### Vectorizer

In [12]:
tfidf_vect = TfidfVectorizer(max_features=15)

In [13]:
X_trans = tfidf_vect.fit_transform(X_train)

### Instantiate LinearSVC

In [14]:
classifier = LinearSVC(C=1.0, max_iter=1000, tol=1e-3)

In [15]:
linear_svc_model = classifier.fit(X_trans, y_train)

In [16]:
X_test_trans = tfidf_vect.fit_transform(X_test)

In [17]:
y_pred = linear_svc_model.predict(X_test_trans)

### Score

In [18]:
pred_results = pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})

In [19]:
accuracy = accuracy_score(y_test, y_pred)

In [20]:
accuracy

0.9039017341040463

### Pipeline

In [21]:
from sklearn.pipeline import Pipeline

In [22]:
clf_pipeline = Pipeline(steps=[('tfidf_vect', tfidf_vect), ('classifier', classifier)])

In [23]:
pipeline_model = clf_pipeline.fit(X_train, y_train)

In [24]:
y_pred = pipeline_model.predict(X_test) #can use data directly

In [25]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9039017341040463

### Create a Checkpoint

In [26]:
pipe_clf_param = {}
pipe_clf_param['pipeline_clf'] = pipeline_model
pipe_clf_param['sklearn_version'] = sklearn.__version__
pipe_clf_param['accuracy'] = accuracy

pipe_clf_param

{'pipeline_clf': Pipeline(steps=[('tfidf_vect', TfidfVectorizer(max_features=15)),
                 ('classifier', LinearSVC(tol=0.001))]),
 'sklearn_version': '1.0.2',
 'accuracy': 0.9039017341040463}

### Save

In [27]:
import joblib

In [28]:
filename = '../models/pipe_clf_model_checkpoint.joblib'

In [29]:
joblib.dump(pipe_clf_param, filename)

['../models/pipe_clf_model_checkpoint.joblib']

_____

### Load

In [30]:
pipe_clf_checkpoint = joblib.load(filename)

In [31]:
reloaded_pipeline = pipe_clf_checkpoint['pipeline_clf']

### Test

In [32]:
y_pred = reloaded_pipeline.predict(X_test)

In [33]:
accuracy_score(y_test, y_pred)

0.9039017341040463