# Modeling
The focus of this notebook is to continue the NLP steps with the text data that has now been turned into a feather file (file name and its corresponding text). Within this notebook you will see the making of the baseline model and model improvements.  

In [1]:
# Data Importing and Manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statistics import mean

#Sentiment Analysis
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

# Tokenizing
from textblob.taggers import NLTKTagger

# Time series handling 
import datetime


# # Cleaning up memory on computer after running code
# import gc; gc.enable()

# import string

# # String handling and NLP model
# from sklearn.feature_extraction.text import TfidfVectorizer
# import nltk
# from nltk import word_tokenize
# from nltk.tokenize import treebank
# # nltk.download('opinion_lexicon')
# # nltk.download('vader_lexicon')
# # nltk.download('punkt')
# from nltk.corpus import opinion_lexicon

# from sklearn.model_selection import train_test_split

In [2]:
# Reading from 'Feather' format
df = pd.read_feather('/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/CSV_Files/10-Qs.feather')
del df['index']; gc.collect()
df.head()

Unnamed: 0,file_name,text
0,0000092122-17-000024.txt,txt hdrsgml access number conform submiss type...
1,0000092122-18-000050.txt,txt hdrsgml access number conform submiss type...
2,0000092122-19-000016.txt,txt hdrsgml access number conform submiss type...
3,0000092122-18-000027.txt,txt hdrsgml access number conform submiss type...
4,0000092122-16-000213.txt,txt hdrsgml access number conform submiss type...


In [3]:
df.shape # Viewing the shape of the df

(13, 2)

In [157]:
df['Year'] = [2017, 2018, 2019, 2018, 2016, 2020, 2017, 2016, 2016, 2019, 2017, 2019, 2018] 
# df['file_name'][7][11:13]
# How you find the year document

'16'

# Next Steps
While condicting online research I came across a YouTube video that went through using the library Textblob. In the previous notbook we were able to remove punctuations, stopwords, and digets(could not parse through because of te timeframe we have. 

This can be done in the near future). We were also able to nomalize the text and create a dataframe from the nomalized text and its corresponding file name. Below you will see my attemped to perform a sentiment analysis on the text data in the df we currently have. The documentation for this library is located in the resource section. 

# Test Run
**Lets try using the libray on one body of text from the df.** 

In [6]:
# Create Textblob obj to be used for analysis
obj = TextBlob(df.text[10]) 

In [121]:
# Returning a value between 1 & -1
# Using initiated obj to get polarity value
sentiment = obj.sentiment

In [122]:
print(sentiment) # Printing results 

Sentiment(polarity=0.014091192227164666, subjectivity=0.23202066756152523)


# Turn It Into A Function
It worked! Now lets try this for every text body in the df by writing a function!! The below cells will consist of two function; one providing the list of sentiment polarity values for each body of text in df[*text*] & another displaying if the text is negative, positive or neutral.

In [124]:
'''This function will take in the df as an argument and 
iterate through the text column that is sliced to the body 
of text its self and return a list of sentiment polarity value 
for each body of text'''
def sentiment_ana(df):
    sentiments_list = []
    i = [0,1,2,3,4,5,6,7,8,9,10,11,12]
    for text in df.text[i]:
            obj = TextBlob(text)
            sentiment = obj.sentiment.polarity
            sentiments_list.append(sentiment)
    return sentiments_list 
            

In [125]:
SO_Sentiments = sentiment_ana(df) # Calling function and assigning to variable

In [126]:
SO_Sentiments # Viewing list of polarity values 

[0.025743580469827186,
 -0.0033127985425172066,
 0.03165365439445174,
 0.02479712515096895,
 0.019257323182357684,
 0.018702514236993255,
 0.014686811941619592,
 0.030788767127009578,
 0.024793322909917113,
 0.047629955462548425,
 0.014091192227164666,
 0.024803428417207694,
 0.01055402079920463]

# Qucik Observation
Now that we have a list of floats that represent each body of text in the df (all 13 10-Qs for Southern Co), lets find the avereage the get and idea of what the over all sentiment of the documents are. You will see the overall language of the documents are positive. But we need a model to predict the next documents sentiment!

In [102]:
SA_Mean = mean(SO_Sentiments)
if SA_Mean == 0: 
    print(SA_Mean)
    print('The text Is Neutral')
    print('\n') 
elif SA_Mean > 0:
    print(SA_Mean)
    print('The Text Is Positive')
    print('\n') 
else:
    print(SA_Mean)
    print('The Text Is Negative')
    print('\n') 

0.02186068444436564
The Text Is Positive




# Is It Positive, Neutral or Negative

In [72]:
'''This function will take in the list of polarity values as an argument and 
iterate through the list, check if the value meets any of the conditions,
and print the text that matches the values met condition.'''
def SA(_list):
    for sentiment in _list:
        if sentiment == 0: 
            print(sentiment)
            print('The text Is Neutral')
            print('\n') 
        elif sentiment > 0:
            print(sentiment)
            print('The Text Is Positive')
            print('\n') 
        else:
            print(sentiment)
            print('The Text Is Negative')
            print('\n') 

In [76]:
Sentiment_Results = SA(SO_Sentiments)

0.025743580469827186
The Text Is Positive


-0.0033127985425172066
The Text Is Negative


0.03165365439445174
The Text Is Positive


0.02479712515096895
The Text Is Positive


0.019257323182357684
The Text Is Positive


0.018702514236993255
The Text Is Positive


0.014686811941619592
The Text Is Positive


0.030788767127009578
The Text Is Positive


0.024793322909917113
The Text Is Positive


0.047629955462548425
The Text Is Positive


0.014091192227164666
The Text Is Positive


0.024803428417207694
The Text Is Positive


0.01055402079920463
The Text Is Positive




# Test Run 2
Looking into the advance use of Textblob I was able to find code that can do the sentiment analysis of the text for me, in just a few lines of code. This code was taken from the Textblod documentation. The link to this code is provided in the Resource Section of the notebook. Lets test out this code  ...

In [114]:
# NLTK classifier trained on a movie reviews corpus.
# Using trained NB Analyzer to do sentiment analysis on a body of text
blob = TextBlob(df.text[0], analyzer=NaiveBayesAnalyzer())
blob.sentiment

Sentiment(classification='pos', p_pos=1.0, p_neg=7.994055237498093e-59)

In [134]:
SA_p2 = sentiment_ana2(df)
SA_p2
# polarity is how positive or neg text is between -1 & 1
# subjectivity is how opinionated the text seems to be, between -1 & 1

[Sentiment(polarity=0.025743580469827186, subjectivity=0.2540728170602351),
 Sentiment(polarity=-0.0033127985425172066, subjectivity=0.21546182328981334),
 Sentiment(polarity=0.03165365439445174, subjectivity=0.2412985417005685),
 Sentiment(polarity=0.02479712515096895, subjectivity=0.24103082251213098),
 Sentiment(polarity=0.019257323182357684, subjectivity=0.24951626760818546),
 Sentiment(polarity=0.018702514236993255, subjectivity=0.3198025500406785),
 Sentiment(polarity=0.014686811941619592, subjectivity=0.2394814773079978),
 Sentiment(polarity=0.030788767127009578, subjectivity=0.2808617455424036),
 Sentiment(polarity=0.024793322909917113, subjectivity=0.26371509564186707),
 Sentiment(polarity=0.047629955462548425, subjectivity=0.29089749264666365),
 Sentiment(polarity=0.014091192227164666, subjectivity=0.23202066756152523),
 Sentiment(polarity=0.024803428417207694, subjectivity=0.22793639634486648),
 Sentiment(polarity=0.01055402079920463, subjectivity=0.24379420524338227)]

In [141]:
def sentiment_ana3(df):
    sentiments_list = []
    i = [0,1,2,3,4,5,6,7,8,9,10,11,12]
    for text in df.text[i]:
            blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
            sentiments = blob.sentiment
            sentiments_list.append(sentiments)
    return sentiments_list

In [142]:
SA_p3 = sentiment_ana3(df)

In [202]:
SA_p3

[Sentiment(classification='pos', p_pos=1.0, p_neg=7.994055237498093e-59),
 Sentiment(classification='pos', p_pos=1.0, p_neg=1.9328015799642474e-54),
 Sentiment(classification='pos', p_pos=1.0, p_neg=8.681604915330248e-51),
 Sentiment(classification='pos', p_pos=1.0, p_neg=8.332839280044437e-45),
 Sentiment(classification='pos', p_pos=1.0, p_neg=7.02691649602693e-62),
 Sentiment(classification='pos', p_pos=1.0, p_neg=9.397962013413978e-46),
 Sentiment(classification='pos', p_pos=1.0, p_neg=4.991106707551972e-91),
 Sentiment(classification='pos', p_pos=1.0, p_neg=1.9355736723606737e-73),
 Sentiment(classification='pos', p_pos=1.0, p_neg=2.035307026919699e-60),
 Sentiment(classification='pos', p_pos=1.0, p_neg=2.438798193514326e-54),
 Sentiment(classification='pos', p_pos=1.0, p_neg=4.995200663023715e-50),
 Sentiment(classification='pos', p_pos=1.0, p_neg=2.68729553663566e-53),
 Sentiment(classification='pos', p_pos=1.0, p_neg=3.896515836791994e-65)]

# Feature Engineering

In [185]:
# Creating list of years the file are from using file_name column
df['Year'] = [2017, 2018, 2019, 2018, 2016, 2020, 2017, 2016, 2016, 2019, 2017, 2019, 2018] 
# df['file_name'][7][11:13]
# How you find the year document

In [187]:
# Code used below cam from Alice Zhao Link to github will be in Resource Section
# Code used to calculate the sentiment polar and subjective vale
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

df['polarity'] = df['text'].apply(pol)
df['subjectivity'] = df['text'].apply(sub)
df

Unnamed: 0,file_name,text,Year,polarity,subjectivity
0,0000092122-17-000024.txt,txt hdrsgml access number conform submiss type...,2017,0.025744,0.254073
1,0000092122-18-000050.txt,txt hdrsgml access number conform submiss type...,2018,-0.003313,0.215462
2,0000092122-19-000016.txt,txt hdrsgml access number conform submiss type...,2019,0.031654,0.241299
3,0000092122-18-000027.txt,txt hdrsgml access number conform submiss type...,2018,0.024797,0.241031
4,0000092122-16-000213.txt,txt hdrsgml access number conform submiss type...,2016,0.019257,0.249516
5,0000092122-20-000042.txt,txt hdrsgml access number conform submiss type...,2020,0.018703,0.319803
6,0000092122-17-000065.txt,txt hdrsgml access number conform submiss type...,2017,0.014687,0.239481
7,0000092122-16-000144.txt,txt hdrsgml access number conform submiss type...,2016,0.030789,0.280862
8,0000092122-16-000179.txt,txt hdrsgml access number conform submiss type...,2016,0.024793,0.263715
9,0000092122-19-000053.txt,txt hdrsgml access number conform submiss type...,2019,0.04763,0.290897


# Resources

**Everything Textblob**

Install: https://textblob.readthedocs.io/en/dev/install.html

Tutorial: https://textblob.readthedocs.io/en/dev/classifiers.html#classifiers

Advance: https://textblob.readthedocs.io/en/dev/advanced_usage.html

**Everything YouTube**

Sentiment Analysis w/ Textblob: https://www.youtube.com/watch?v=N9CT6Ggh0oE

Sentiment Analysis w/ Textblob: https://www.youtube.com/watch?v=bUgKhp8YwO0

**Github**
Alice Zhao Github Notebook: https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/3-Sentiment-Analysis.ipynb