<a href="https://colab.research.google.com/github/eshitamalhotraaaa/datasciencecoursera/blob/master/Copy_of_SVC_Classification_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Exploration (Exploratory Data Analysis)**

In [None]:
import sklearn
import numpy as np
import pandas as pd

In [None]:
#training data
path = "/content/drive/MyDrive/Colab_CSV/Covid_Data.csv"
train = pd.read_csv(path)

#test data
path1 = "/content/drive/MyDrive/Colab_CSV/corona_dataset_test.csv"
test = pd.read_csv(path1)

In [None]:
train.head(10)

Unnamed: 0,Serial Number,Tweet ID,Tweet Content,Label
0,1,1.24e+18,And if you found something fishy just informed...,Informative
1,2,1.24e+18,Canada\u2019s cyber spies taking down sites as...,Non-Actionable
2,3,1.37e+18,"@MehHarshil Lol.... Yes, it is true...and you ...",Non-Actionable
3,4,1.24e+18,India would have been saved if the politicia...,Non-Actionable
4,5,1.41e+18,Mumbai Covid vaccine scam: Probe says victims ...,Informative
5,6,1.41e+18,"Around 2,000 people were injected with fake CO...",Informative
6,7,1.41e+18,Mumbai Police has launched an investigation in...,Informative
7,8,1.41e+18,Mumbai Vaccine Scam: The police on investigati...,Informative
8,9,1.41e+18,Beware of Covid-19 vaccine scams #SocialMedia ...,Informative
9,10,1.41e+18,Mumbai vaccine scam: 4 arrested for organizing...,Informative


In [None]:
mapping_dataCol = {'Non-Actionable':0,'Informative':1,'Debatable':2,'Other Fraud':3,'Party Politics':4,
'Actionable':5,'Negative':6}
train["Label"]=train["Label"].map(mapping_dataCol)

# mapping_dataCol = {'N':0, 'A':1}

In [None]:
#To check if any values are NULL or are missing
train.isnull().sum()

Serial Number    0
Tweet ID         0
Tweet Content    0
Label            0
dtype: int64

In [None]:
#To check if any values are NULL or are missing in Boolean

train.isnull().values.any() 


False

**Data Cleaning**

In [None]:
#Install Tweet-preprocessor to clean the tweets

!pip install tweet-preprocessor

Collecting tweet-preprocessor
  Downloading tweet_preprocessor-0.6.0-py3-none-any.whl (27 kB)
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0


In [None]:
#remove special characters using the regular expression library
import re

#set up punctuations we want to be replaced
REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\|)|(\()|(\))|(\[)|(\])|(\%)|(\$)|(\>)|(\<)|(\{)|(\})")
REPLACE_WITH_SPACE = re.compile("(<br\s/><br\s/?)|(-)|(/)|(:).")

In [None]:
import preprocessor as p

# function to clean the dataset (combining tweet_preprocessor and reguar expression)

def clean_tweets(df):
  tempArr = []
  for line in df:
    # send to tweet_processor
    tmpL = p.clean(line)
    # remove puctuation
    tmpL = REPLACE_NO_SPACE.sub("", tmpL.lower()) # convert all tweets to lower cases
    tmpL = REPLACE_WITH_SPACE.sub(" ", tmpL)
    tempArr.append(tmpL)
  return tempArr

In [None]:
# clean training data
train_tweet = clean_tweets(train["Tweet Content"])
train_tweet = pd.DataFrame(train_tweet)

In [None]:
# append cleaned tweets to the training data
train["clean_tweet"] = train_tweet

# compare the cleaned and uncleaned tweets
train.head(10)

Unnamed: 0,Serial Number,Tweet ID,Tweet Content,Label,clean_tweet
0,1,1.24e+18,And if you found something fishy just informed...,1,and if you found something fishy just informed...
1,2,1.24e+18,Canada\u2019s cyber spies taking down sites as...,0,canada\u2019s cyber spies taking down sites as...
2,3,1.37e+18,"@MehHarshil Lol.... Yes, it is true...and you ...",0,lol yes it is trueand you have no idea about o...
3,4,1.24e+18,India would have been saved if the politicia...,0,india would have been saved if the politicians...
4,5,1.41e+18,Mumbai Covid vaccine scam: Probe says victims ...,1,mumbai covid vaccine scam probe says victims g...
5,6,1.41e+18,"Around 2,000 people were injected with fake CO...",1,around people were injected with fake covid 19...
6,7,1.41e+18,Mumbai Police has launched an investigation in...,1,mumbai police has launched an investigation in...
7,8,1.41e+18,Mumbai Vaccine Scam: The police on investigati...,1,mumbai vaccine scam the police on investigatio...
8,9,1.41e+18,Beware of Covid-19 vaccine scams #SocialMedia ...,1,beware of covid 19 vaccine scams via
9,10,1.41e+18,Mumbai vaccine scam: 4 arrested for organizing...,1,mumbai vaccine scam arrested for organizing fa...


In [None]:
# clean the test data and append the cleaned tweets to the test data
test_tweet = clean_tweets(test["Tweet Content"])
test_tweet = pd.DataFrame(test_tweet)
# append cleaned tweets to the training data
test["clean_tweet"] = test_tweet

# compare the cleaned and uncleaned tweets
test.tail()

Unnamed: 0,Serial Number,Tweet ID,Tweet Content,Label,clean_tweet
75,396,1408834228482023424,Mumbai: Jobless travel agents got roped in fra...,Other Fraud,mumbai jobless travel agents got roped in frau...
76,397,1407983436438990853,Dear Gujrat!\nDon\u2019t ever get fooled by th...,Negative,dear gujrat\ndon\u2019t ever get fooled by the...
77,398,1407946489041543177,The #BombayHighCourt is hearing an advocate's ...,Non-Actionable,the is hearing an advocates pil regarding prob...
78,399,1407761645070290946,Man Awaiting Trial for Covid-19 Bank Fraud Doe...,Non-Actionable,man awaiting trial for covid 19 bank fraud doe...
79,400,1407739158005571587,"@DelhiPolice A man named masood Hashim, a resi...",Actionable,a man named masood hashim a resident of janta ...


**Test and Train Split**

In [None]:
from sklearn.model_selection import train_test_split

# extract the labels from the train data
y = train.Label.values

# use 70% for the training and 30% for the test
x_train, x_test, y_train, y_test = train_test_split(train.clean_tweet.values, y, 
                                                    stratify=y, 
                                                    random_state=1, 
                                                    test_size=0.2, shuffle=True)

**Vectorize tweets using CountVectorizer**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# vectorize tweets for model building
vectorizer = CountVectorizer(binary=False, stop_words='english')

# learn a vocabulary dictionary of all tokens in the raw documents
vectorizer.fit(list(x_train) + list(x_test))

# transform documents to document-term matrix
x_train_vec = vectorizer.transform(x_train)
x_test_vec = vectorizer.transform(x_test)

**Model building- Apply Support Vetor Classifier (SVC)**

In [None]:
from sklearn import svm
# classify using support vector classifier
svm = svm.SVC(kernel = 'linear', probability=True)

# fit the SVC model based on the given training data
prob = svm.fit(x_train_vec, y_train).predict_proba(x_test_vec)

# perform classification and prediction on samples in x_test
y_pred_svm = svm.predict(x_test_vec)

**Classification Report**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
Y_pred = svm.predict(x_test_vec)
print(classification_report(y_test.reshape(-1,1), Y_pred))

              precision    recall  f1-score   support

           0       0.64      0.93      0.76        41
           1       0.60      0.47      0.53        19
           2       0.00      0.00      0.00         7
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00         2
           5       0.50      0.50      0.50         4
           6       0.00      0.00      0.00         3

    accuracy                           0.61        80
   macro avg       0.25      0.27      0.26        80
weighted avg       0.50      0.61      0.54        80



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**PRE SMOTE ACCURACY**

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy score for SVC is: ", accuracy_score(y_test, y_pred_svm) * 100, '%')

Accuracy score for SVC is:  61.25000000000001 %


**POST SMOTE ACCURACY**

In [None]:
pip install imblearn



In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()

In [None]:
X_train_smote, Y_train_smote = smote.fit_resample(x_train_vec.astype('float'), y_train)
from collections import Counter

In [None]:
print("Before SMOTE: ", Counter(y_train))
print("After SMOTE: ", Counter(Y_train_smote))

Before SMOTE:  Counter({0: 165, 1: 77, 2: 28, 3: 18, 5: 14, 4: 9, 6: 9})
After SMOTE:  Counter({0: 165, 4: 165, 1: 165, 3: 165, 2: 165, 6: 165, 5: 165})


In [None]:
from sklearn.metrics import classification_report, confusion_matrix
Y_pred = svm.predict(x_test_vec)
print(classification_report(y_test.reshape(-1,1), Y_pred))

              precision    recall  f1-score   support

           0       0.64      0.93      0.76        41
           1       0.60      0.47      0.53        19
           2       0.00      0.00      0.00         7
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00         2
           5       0.50      0.50      0.50         4
           6       0.00      0.00      0.00         3

    accuracy                           0.61        80
   macro avg       0.25      0.27      0.26        80
weighted avg       0.50      0.61      0.54        80



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
svm.fit(X_train_smote, Y_train_smote)
Y_pred = svm.predict(x_test_vec)
print("Accuracy score for SVM is: ", accuracy_score(y_test, Y_pred) * 100, '%')

Accuracy score for SVM is:  51.24999999999999 %
