<h1>3 Flair Detector </h1>
<br>
The Logistic Regression classification algorithm is easily generalized to multiple classes. 

In [44]:
import pandas as pd
import os
india_data = pd.read_csv("reddit_india_data.csv")

In [45]:
flairs = india_data.flair.unique()

<b>3.1 Text Pre-Processing:</b> For our data set from reddit, the text cleaning step includes:
    a) remove stop words, 
    b) change text to lower case, 
    c) remove punctuation, 
    d) remove bad characters
  
In the print_val funtion, when an index value between 0 and 999 is given (as understood from 2.1 where we found out the size of the data frame), the title and the flair at that index which are stored into the list "val" will be printed to display the information that will be relevant to the classifier that is, the title and the flair. We are defining this function to recognize what kind of text-preprocessing will be required on our data. </p>

In [46]:
def print_val(index):
    val = india_data[india_data.index == index].values[0]
    print('Title: ', val[1])
    print('Flair:', val[0])

<p><b>3.2 </b>Using the regular expression module, we describe the symbols that need replacement with space and need to be removed. This is done with the sub() function in the re module. In stop_words, we store the stopwords list taken from the natural language toolkit (nltk). </p>

In [47]:
import re
import nltk
from nltk.corpus import stopwords

replace_by_space = re.compile('[/(){}\[\]\|@,;]')
bad_symbols = re.compile('[^0-9a-z #+_]')
stop_words = stopwords.words('english')

def clean_data(text):
    #converting to lowercase
    text = text.lower()
    #re.sub(new_value, text_to_processed) 
    text = replace_by_space.sub(' ', text)
    text = bad_symbols.sub('', text)
    #removing the stopwords
    text = ' '.join(word for word in text.split() if word not in stop_words) 
    return text

<p><b>3.2</b> The url has certain values that could negatively impact the model's prediction therefore it needs an extra cleaning step where we remove http, https and www from the string. </p>

In [48]:
def clean_url(u):
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

<p><b>3.3</b> Cleaning all the values that will be taken into consideration for the classifier. </p>

In [49]:
india_data['title'] = india_data['title'].apply(clean_data)
india_data['comments'] = india_data['comments'].astype('str').apply(clean_data)
india_data['body'] = india_data['body'].astype('str').apply(clean_data)

india_data['url'] = india_data['url'].apply(clean_url)
india_data['url'] = india_data['url'].apply(clean_data)

<h3> Logistic Regression Model (Title) </h3>

In [7]:
from sklearn.model_selection import train_test_split
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) #random state instance will be created by np.random

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report


logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.74
                  precision    recall  f1-score   support

   Non-Political       0.63      0.60      0.62        20
       Scheduled       0.60      0.56      0.58        27
     Photography       0.91      0.95      0.93        21
        Politics       0.95      0.95      0.95        19
Business/Finance       0.93      0.93      0.93        14
  Policy/Economy       0.78      0.90      0.84        20
          Sports       0.46      0.60      0.52        20
            Food       0.82      0.85      0.84        27
        AskIndia       0.69      0.55      0.61        20
     Coronavirus       0.75      0.50      0.60        12

        accuracy                           0.74       200
       macro avg       0.75      0.74      0.74       200
    weighted avg       0.74      0.74      0.74       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<p>The accuracy from Logistic Regression is seen to be around 72% </p>

<h3>Logistic Regression Model (Title + Comments) </h3>

In [51]:
india_data['title'] = india_data['title'] + india_data['comments']

In [52]:
from sklearn.model_selection import train_test_split
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) #random state instance will be created by np.random

In [53]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report


logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.65
                  precision    recall  f1-score   support

   Non-Political       0.57      0.50      0.53        26
       Scheduled       0.67      0.43      0.52        28
     Photography       0.82      0.74      0.78        19
        Politics       0.80      1.00      0.89        20
Business/Finance       0.47      0.80      0.59        10
  Policy/Economy       0.94      0.83      0.88        18
          Sports       0.47      0.33      0.39        21
            Food       0.54      0.81      0.65        16
        AskIndia       0.56      0.62      0.59        24
     Coronavirus       0.72      0.72      0.72        18

        accuracy                           0.65       200
       macro avg       0.65      0.68      0.65       200
    weighted avg       0.66      0.65      0.64       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments + Body) </h3>

In [12]:
india_data['title'] = india_data['title'] + '  ' + india_data['comments'] + '  ' + india_data['body']

In [13]:
from sklearn.model_selection import train_test_split
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) #random state instance will be created by np.random

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report


logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


accuracy 0.67
                  precision    recall  f1-score   support

   Non-Political       0.47      0.47      0.47        15
       Scheduled       0.48      0.62      0.54        16
     Photography       0.67      0.70      0.68        23
        Politics       0.82      0.86      0.84        21
Business/Finance       0.61      0.55      0.58        20
  Policy/Economy       0.93      0.88      0.90        16
          Sports       0.69      0.39      0.50        23
            Food       0.67      0.64      0.65        28
        AskIndia       0.58      0.71      0.64        21
     Coronavirus       0.84      0.94      0.89        17

        accuracy                           0.67       200
       macro avg       0.68      0.68      0.67       200
    weighted avg       0.68      0.67      0.67       200



<h3>Logistic Regression Model (Title + Comments + Body + URL) </h3>

In [57]:
india_data['title'] = india_data['title'] + '  ' + india_data['comments'] + '  ' + india_data['body'] + ' ' + india_data['url']

In [58]:
from sklearn.model_selection import train_test_split
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) #random state instance will be created by np.random

In [40]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report


logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.775
                  precision    recall  f1-score   support

   Non-Political       0.50      0.43      0.46        21
       Scheduled       0.83      0.68      0.75        22
     Photography       0.89      0.94      0.91        17
        Politics       0.93      0.93      0.93        28
Business/Finance       0.73      0.79      0.76        24
  Policy/Economy       0.93      0.93      0.93        14
          Sports       0.57      0.80      0.67        15
            Food       0.65      0.65      0.65        20
        AskIndia       0.83      0.68      0.75        22
     Coronavirus       0.89      1.00      0.94        17

        accuracy                           0.78       200
       macro avg       0.78      0.78      0.78       200
    weighted avg       0.78      0.78      0.77       200



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments + URL) </h3>

In [54]:
india_data['title'] = india_data['title'] + '  ' + india_data['comments'] + '  ' + india_data['url']

In [55]:
from sklearn.model_selection import train_test_split
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) #random state instance will be created by np.random

In [56]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report


logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


accuracy 0.665
                  precision    recall  f1-score   support

   Non-Political       0.32      0.50      0.39        16
       Scheduled       0.71      0.68      0.69        25
     Photography       0.74      0.74      0.74        19
        Politics       0.80      0.94      0.86        17
Business/Finance       0.52      0.61      0.56        18
  Policy/Economy       0.75      0.79      0.77        19
          Sports       0.72      0.54      0.62        24
            Food       0.67      0.70      0.68        20
        AskIndia       0.56      0.60      0.58        15
     Coronavirus       1.00      0.59      0.74        27

        accuracy                           0.67       200
       macro avg       0.68      0.67      0.66       200
    weighted avg       0.70      0.67      0.67       200



<h3>Logistic Regression Model (Title + URL) </h3>

In [21]:
india_data['title'] = india_data['title'] + '  ' + india_data['url']

In [22]:
from sklearn.model_selection import train_test_split
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) #random state instance will be created by np.random

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report


logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


accuracy 0.61
                  precision    recall  f1-score   support

   Non-Political       0.43      0.50      0.47        20
       Scheduled       0.80      0.57      0.67        21
     Photography       0.71      0.71      0.71        17
        Politics       0.75      0.71      0.73        17
Business/Finance       0.40      0.53      0.45        19
  Policy/Economy       0.83      0.91      0.87        22
          Sports       0.41      0.61      0.49        18
            Food       0.50      0.43      0.46        21
        AskIndia       0.62      0.47      0.53        17
     Coronavirus       0.82      0.64      0.72        28

        accuracy                           0.61       200
       macro avg       0.63      0.61      0.61       200
    weighted avg       0.64      0.61      0.62       200

