<h1>3 Flair Detector </h1>
<br>
The Logistic Regression classification algorithm is easily generalized to multiple classes. 

In [1]:
import pandas as pd
import os
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report

In [2]:
india_data = pd.read_csv("reddit_india_data.csv")

In [3]:
flairs = india_data.flair.unique()

<b>3.1 Text Pre-Processing:</b> For our data set from reddit, the text cleaning step includes:
    a) remove stop words, 
    b) change text to lower case, 
    c) remove punctuation, 
    d) remove bad characters
  
In the print_val funtion, when an index value between 0 and 999 is given (as understood from 2.1 where we found out the size of the data frame), the title and the flair at that index which are stored into the list "val" will be printed to display the information that will be relevant to the classifier that is, the title and the flair. We are defining this function to recognize what kind of text-preprocessing will be required on our data. </p>

In [4]:
def print_val(index):
    val = india_data[india_data.index == index].values[0]
    print('Title: ', val[1])
    print('Flair:', val[0])

<p><b>3.2 </b>Using the regular expression module, we describe the symbols that need replacement with space and need to be removed. This is done with the sub() function in the re module. In stop_words, we store the stopwords list taken from the natural language toolkit (nltk). </p>

In [5]:
replace_by_space = re.compile('[/(){}\[\]\|@,;]')
bad_symbols = re.compile('[^0-9a-z #+_]')
stop_words = stopwords.words('english')

def clean_data(text):
    #converting to lowercase
    text = text.lower()
    #re.sub(new_value, text_to_processed) 
    text = replace_by_space.sub(' ', text)
    text = bad_symbols.sub('', text)
    #removing the stopwords
    text = ' '.join(word for word in text.split() if word not in stop_words) 
    return text

<p><b>3.2</b> The url has certain values that could negatively impact the model's prediction therefore it needs an extra cleaning step where we remove http, https and www from the string. </p>

In [6]:
def clean_url(u):
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

<p><b>3.3</b> Cleaning all the values that will be taken into consideration for the classifier. </p>

In [7]:
india_data['title'] = india_data['title'].apply(clean_data)
india_data['comments'] = india_data['comments'].astype('str').apply(clean_data)
india_data['body'] = india_data['body'].astype('str').apply(clean_data)

india_data['url'] = india_data['url'].apply(clean_url)
india_data['url'] = india_data['url'].apply(clean_data)

<h3> Logistic Regression Model (Title) </h3>

In [8]:
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

In [9]:
logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.77
                  precision    recall  f1-score   support

   Non-Political       0.50      0.80      0.62         5
       Scheduled       0.50      0.50      0.50         8
     Photography       1.00      0.85      0.92        13
        Politics       0.91      0.83      0.87        12
Business/Finance       0.94      1.00      0.97        15
  Policy/Economy       0.73      1.00      0.84         8
          Sports       0.60      0.46      0.52        13
            Food       1.00      0.67      0.80         9
        AskIndia       0.57      0.89      0.70         9
     Coronavirus       1.00      0.62      0.77         8

        accuracy                           0.77       100
       macro avg       0.77      0.76      0.75       100
    weighted avg       0.80      0.77      0.77       100



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments) </h3>

In [10]:
india_data['title'] = india_data['title'] + ' ' + india_data['comments']

In [11]:
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [12]:
logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.72
                  precision    recall  f1-score   support

   Non-Political       0.70      0.64      0.67        11
       Scheduled       0.50      0.62      0.56         8
     Photography       0.78      0.78      0.78         9
        Politics       0.86      0.92      0.89        13
Business/Finance       0.50      0.14      0.22         7
  Policy/Economy       0.64      0.88      0.74         8
          Sports       0.75      0.69      0.72        13
            Food       0.55      0.75      0.63         8
        AskIndia       0.78      0.78      0.78         9
     Coronavirus       0.92      0.79      0.85        14

        accuracy                           0.72       100
       macro avg       0.70      0.70      0.68       100
    weighted avg       0.72      0.72      0.71       100



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments + Body) </h3>

In [13]:
india_data['title'] = india_data['title'] + '  ' + india_data['comments'] + '  ' + india_data['body']

In [14]:
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

In [15]:
logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.63
                  precision    recall  f1-score   support

   Non-Political       0.27      0.33      0.30         9
       Scheduled       0.77      0.62      0.69        16
     Photography       0.73      0.73      0.73        11
        Politics       0.50      0.67      0.57         6
Business/Finance       0.67      0.40      0.50        10
  Policy/Economy       1.00      1.00      1.00         8
          Sports       0.75      0.67      0.71         9
            Food       0.47      0.70      0.56        10
        AskIndia       0.62      0.62      0.62        13
     Coronavirus       0.71      0.62      0.67         8

        accuracy                           0.63       100
       macro avg       0.65      0.64      0.63       100
    weighted avg       0.66      0.63      0.63       100



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments + Body + URL) </h3>

In [16]:
india_data['title'] = india_data['title'] + '  ' + india_data['comments'] + '  ' + india_data['body'] + ' ' + india_data['url']

In [17]:
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [18]:
logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.66
                  precision    recall  f1-score   support

   Non-Political       0.64      0.64      0.64        11
       Scheduled       0.88      0.50      0.64        14
     Photography       0.78      0.58      0.67        12
        Politics       1.00      0.80      0.89        10
Business/Finance       0.50      0.30      0.37        10
  Policy/Economy       0.78      1.00      0.88         7
          Sports       0.62      1.00      0.77        10
            Food       0.43      0.86      0.57         7
        AskIndia       0.50      0.50      0.50        10
     Coronavirus       0.67      0.67      0.67         9

        accuracy                           0.66       100
       macro avg       0.68      0.68      0.66       100
    weighted avg       0.69      0.66      0.65       100



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + Comments + URL) </h3>

In [19]:
india_data['title'] = india_data['title'] + '  ' + india_data['comments'] + '  ' + india_data['url']

In [20]:
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

In [21]:
logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

accuracy 0.61
                  precision    recall  f1-score   support

   Non-Political       0.00      0.00      0.00         6
       Scheduled       0.54      0.64      0.58        11
     Photography       0.82      0.75      0.78        12
        Politics       0.80      0.67      0.73        12
Business/Finance       0.40      0.22      0.29         9
  Policy/Economy       0.86      0.86      0.86         7
          Sports       0.46      0.55      0.50        11
            Food       0.60      0.75      0.67        12
        AskIndia       0.67      0.80      0.73        10
     Coronavirus       0.60      0.60      0.60        10

        accuracy                           0.61       100
       macro avg       0.57      0.58      0.57       100
    weighted avg       0.60      0.61      0.60       100



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


<h3>Logistic Regression Model (Title + URL) </h3>

In [22]:
india_data['title'] = india_data['title'] + '  ' + india_data['url']

In [23]:
X = india_data.title
y = india_data.flair

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) 

In [24]:
logreg = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', LogisticRegression(n_jobs=1, C=1e5))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=flairs))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


accuracy 0.66
                  precision    recall  f1-score   support

   Non-Political       0.60      0.38      0.46         8
       Scheduled       0.69      0.75      0.72        12
     Photography       0.90      0.60      0.72        15
        Politics       1.00      0.88      0.93         8
Business/Finance       0.33      0.44      0.38         9
  Policy/Economy       0.78      0.88      0.82         8
          Sports       0.60      0.46      0.52        13
            Food       0.50      1.00      0.67         7
        AskIndia       0.62      0.45      0.53        11
     Coronavirus       0.75      1.00      0.86         9

        accuracy                           0.66       100
       macro avg       0.68      0.68      0.66       100
    weighted avg       0.69      0.66      0.66       100

