## Building the Model

## Tasks

- Test & record accuracy with:
> - N-grams
> - TfIdf
> - Random Forest
> - Naive Bayes

## Outcomes

- Tables comparing Accuracy Scores 

- Visualisations to compare Accuracy Scores

------

### Import Pandas and Set Options

In [1]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

### Read the data

In [2]:
df = pd.read_csv('football_sentiment_training_data.csv', engine="python")

In [3]:
len(df)

1050000

### Remove None Type Rows

In [4]:
df = df.dropna()

### Add Numeric Columns to represent sentiment
- 0 = Negative
- 1 = Neutral
- 2 = Positive

In [5]:
def sentiment_numbers(x):
    
    if str(x) == "NEGATIVE":
        return 0
    elif str(x) == "NEUTRAL":
        return 1
    elif str(x) == "POSITIVE":
        return 2

In [6]:
df['sentiment_number'] = df['sentiment'].apply(sentiment_numbers)

-----

### Vectorizer

Try out different vectorisation methods here including *TfIdf*

Try out different n_gram ranges

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [270]:
# Declare Vectorizer
vect = TfidfVectorizer(binary=True, ngram_range=(3,3))

In [272]:
# Fit Vectorizer
vect.fit(df['clean'])

TfidfVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(3, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [273]:
# Transform Vectorizer
X = vect.transform(df['clean'])

-----

### Build the Classifier 

Uses a Logistic Regression model but will need to test on different Classifiers

In [274]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

#### Declare target / labels

In [275]:
target = df['sentiment']

#### Split Train | Test Data

In [276]:
# Test size = 0.25 | Train size = 0.75
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size = 0.25, random_state=123)

#### Find Optimal C Value for Logistic Regression

In [278]:
for c in [1]:
    lr = LinearSVC(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_test, lr.predict(X_test))))

Accuracy for C=1: 0.6108322566629698


#### Build model using optimal C

In [279]:
lr = LinearSVC(C=0.5)

In [280]:
# Fitting Logistic Regression model on training data - Takes time!!
lr.fit(X_train, y_train)

LinearSVC(C=0.5, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

#### Print Accuracy Score

In [281]:
print("Accuracy Score: {:.2f}".format(lr.score(X_test, y_test)))

Accuracy Score: 0.61


#### Confusion matrix / Precision Scores

In [282]:
from sklearn.metrics import classification_report

In [283]:
# Predicted Values
y_pred = lr.predict(X_test)

In [284]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    NEGATIVE       0.67      0.59      0.63     87204
     NEUTRAL       0.51      0.70      0.59     87986
    POSITIVE       0.73      0.55      0.63     87193

   micro avg       0.61      0.61      0.61    262383
   macro avg       0.64      0.61      0.61    262383
weighted avg       0.64      0.61      0.61    262383



-----

-----

## Save and Export Model & Vectorizer

In [286]:
import pickle

#### Model

In [287]:
# Specify Filename
filename = 'SVC_tfidf_3_3.sav'

In [288]:
# Open new file
outfile = open(filename,'wb')

In [289]:
# Dump model into new file
pickle.dump(lr, outfile)

In [290]:
# Close the file
outfile.close()

#### Vectorizer

In [135]:
# Specify Filename
#filename_vect = "tfidf_3_3.sav"

In [136]:
# Open new file
#outfile_vect = open(filename_vect,'wb')

In [137]:
# Dump model into new file
#pickle.dump(vect, outfile_vect)

In [138]:
# Close the file
#outfile.close()

------

 ------

### Define Function to Predict Sentiment of Unseen Data

In [113]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer 

In [114]:
import stopwords

In [115]:
stop_words = stopwords.stopwords()

In [126]:
def sentiment_prob(input):
    
    #lemmatize input
    lemmatized_result = " ".join([WordNetLemmatizer().lemmatize(word) for word in word_tokenize(input) if word not in stop_words])
    
    #Tokenize the sentences
    sentences = sent_tokenize(input)
    lemmatized_tokens = sent_tokenize(lemmatized_result)
    
    #Remove Stopwords & Lemmatize
    #no_stopwords = ' '.join([word for word in input if word not in (ENGLISH_STOP_WORDS)])
    
    # Predictions
    predictions = pd.DataFrame(lr.decision_function((vect.transform(lemmatized_tokens))))
    
    # Create DataFrame
    predictions['sentences'] = lemmatized_tokens
    
    # Rename Cols
    predictions = predictions.rename(columns={0: "negative_prob", 1: "neutral_prob", 2: "positive_prob"})
    
    # Declare Threshold
    threshold = 0.55
    
    # Sentiment List
    sentiment_list = []
    
    # Determine sentiment 
    # Need to change again for Logistic Regression
    
    for index, row in predictions.iterrows():
        if row['negative_prob'] > row['positive_prob'] and row['negative_prob'] > row['neutral_prob']:
            sentiment_list.append("negative")
        elif row['positive_prob'] > row['negative_prob'] and row['positive_prob'] > row['neutral_prob']:
            sentiment_list.append("positive")
        else:
            sentiment_list.append("neutral")
            
    predictions['sentiment'] = sentiment_list
    
    return predictions;

In [121]:
sentiment(""" It's fair to say the signing of Ighalo wasn't the most popular in United's history at the time, but the Nigerian has gone on to prove an awful lot of people wrong.

The 30-year-old grabbed his fourth goal in United colours, and displayed some delightfully silky footwork to take the ball past the defender before thumping a strike into the top corner.

Ighalo is proving that he is much more than the target man to lump balls up to that some people had suggested, and if he continues this rich vein of form he could end up being one of the most shrewd signings of the whole season.r the Bundesliga title is likely to return to its perennial destination.
 """)

Unnamed: 0,0,sentences
0,NEGATIVE,"It's fair to say the signing of Ighalo wasn't the most popular in United's history at the time, but the Nigerian has gone on to prove an awful lot of people wrong."
1,NEUTRAL,"The 30-year-old grabbed his fourth goal in United colours, and displayed some delightfully silky footwork to take the ball past the defender before thumping a strike into the top corner."
2,NEUTRAL,"Ighalo is proving that he is much more than the target man to lump balls up to that some people had suggested, and if he continues this rich vein of form he could end up being one of the most shrewd signings of the whole season.r the Bundesliga title is likely to return to its perennial destination."


In [128]:
sentiment_prob(""""Early goals from Mason Mount and Pedro put Chelsea into a commanding lead over Everton which Carlo Ancelotti's men never recovered from, Chelsea made light work of Everton in a dominant victory at Stamford Bridge on Sunday. ,  The Blues took a 14th-minute lead after Mason Mount controlled Pedro's pass on the turn before firing hard and low past Jordan Pickford at the near post. ,  Soon after Pedro had got his own name on the scoresheet, racing clear of the Toffees' defence to finish past Pickford and double the advantage. , Willian then put the game beyond doubt just five minutes into the second half as he drilled a 25-yard shot into the far corner. The Brazilian then turned provider with a curling cross to the far post which Olivier Giroud tucked away, as Everton sank without trace. Here are John Cross's ratings from Stamford Bridge. Arrizabalaga 7, Back in the starting line-up, looked good. Does he have a future, after all?, Azpilicueta 6, Solid in defence, did not offer so much going forward but reliable. Rudiger 7, Rock solid, used his pace well and kept Everton’s forwards quiet. Zouma 7, Surprise pick, had an early wobble but looked strong from then on. Alonso 7, Had a good game, linked up and overlapped nicely from left back. Barkley 8, Big performance against old club. Two assists and had last laugh. Gilmour 7, When Roy Keane calls you world class, you must be special. He is. Mount 7, Well taken goal, back to his lively best before going off injured. Pedro 8, MotM. His best game for ages, well taken goal and was excellent overall. Willian 7, Electric pace, scored and dangerous. Proving Chelsea need to keep him. Giroud 7, Scored but also provided a focal point for Chelsea’s attack. Excellent. Subs, James, for Mount, 60 mins, 6, Anjorin, fo Willian, 71 mins, Broja, for Giroud, 87 mins, Pickford 7, Some good saves, Everton would have lost by double figures without him. Sidibe 4, Had a stinker, sloppy and undone. Very uncomfortable at right back. Keane 5, Got outdone by Giroud’s physical strength. Bad day at the office. Holgate 4, Horribly exposed, poor positioning and lack of pace. Nightmare. Digne 5, Has been struggling with injury and did not look at his sharpest. Bernard 4, Hooked at half time. Did get an injury, but could have been subbed anyway. Gomes 5, Given the runaround, could not get close to his Chelsea opponents. Davies 5, Subbed early after a nightmare performance where he was just chasing shadows. Sigurdsson 5, Easy to forget he was even playing. The game completely passed him by. Calvert-Lewin 5, Went clean through but fired wide with a chance to get Everton back in it. Richarlison 5, Barely had a look-in, no service and could not provide an outlet. Subs, Walcott, for Bernard, 46 mins, 5, Kean, for Davies, 57 mins, 5, Gordon, for Calvert-Lewin, 76 mins" """)

Unnamed: 0,negative_prob,neutral_prob,positive_prob,sentences,sentiment
0,-3.221297,1.023393,0.021843,"`` Early goal from Mason Mount Pedro Chelsea commanding lead Everton Carlo Ancelotti 's men recovered from , Chelsea made light work Everton dominant victory Stamford Bridge Sunday .",neutral
1,-0.603206,0.496868,-1.353648,", The Blues took 14th-minute lead after Mason Mount controlled Pedro 's pas turn firing hard low past Jordan Pickford near post .",neutral
2,-1.901474,0.513729,0.153273,", Soon after Pedro got his scoresheet , racing clear Toffees ' defence finish past Pickford double advantage .",neutral
3,-0.308802,0.207838,-0.929148,", Willian game doubt just five minute second half drilled 25-yard shot far corner .",neutral
4,-1.549574,1.060306,-0.9617,"The Brazilian turned provider with curling cross far post Olivier Giroud tucked away , Everton sank trace .",neutral
5,-1.362039,1.384647,-1.579144,Here John Cross 's rating from Stamford Bridge .,neutral
6,-0.922641,-0.458226,0.325058,"Arrizabalaga 7 , Back starting line-up , looked good .",positive
7,-0.629636,0.231492,-0.609062,"Does future , after all ?",neutral
8,-0.557152,-0.9694,0.325402,", Azpilicueta 6 , Solid defence , did not offer much going forward reliable .",positive
9,-2.017886,-0.000682,0.637651,"Rudiger 7 , Rock solid , used his pace kept Everton ’ s forward quiet .",positive
