<a href="https://colab.research.google.com/github/ZacharyFry1/DD-Science-Cohort15/blob/main/Project_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 5 - NLP

## Problem Definition


The objective of this project is to choose a famous person from the data frame below and perform a text analysis to figure out who the nearest ten neighbors are. This will be accomplished through bag of words and a TF-IDF transformation.

## Data Collection/Sources


Imports

In [None]:
import numpy as np
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import NearestNeighbors

## Part 1

Installing text blobs and corpora.

In [None]:
%%capture
# Install textblob
!pip install -U textblob

In [None]:
%%capture
# Download corpora
!python -m textblob.download_corpora


Reading in the data.

In [None]:
url = 'https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv'
df_original = pd.read_csv(url)

Getting the shape of the data.

In [None]:
df_original.shape

(42786, 3)

## Sentiment Analysis

Using a lambda function to create a new column, Text_Blob_Text that converts the summary text for everything in the data frame into a text blob. Required for sentiment analysis.

In [None]:
df_original['Text_Blob_Text'] = df_original['text'].apply(lambda x: TextBlob(x))

Below is the sentiment of person 55, Mary Goldring.

In [None]:
df_original['Text_Blob_Text'][55].sentiment

Sentiment(polarity=-0.10098039215686273, subjectivity=0.2966817496229261)

In [None]:
df_original.loc[55]

Unnamed: 0,55
URI,<http://dbpedia.org/resource/Mary_Goldring>
name,Mary Goldring
text,mary goldring obe is a british business journa...
Text_Blob_Text,"(m, a, r, y, , g, o, l, d, r, i, n, g, , o, ..."


## Pattern Mining

### Vectorizing / BoW

Creating a bag of words in order to fit a TF-IDF model.

In [None]:
vectorizer = CountVectorizer(stop_words='english')
bow_vec = vectorizer.fit_transform(df_original[:]['text'])
bow_vec

<42786x437190 sparse matrix of type '<class 'numpy.int64'>'
	with 5847547 stored elements in Compressed Sparse Row format>

### TF-IDF

Transforming the bag of words vector to TF-IDF in order to perform a nearest neighbors comparison on the words.

In [None]:
tf_idf_vec = TfidfTransformer()
tf_idf_fit = tf_idf_vec.fit_transform(bow_vec)


### Nearest Neighbors

Fitt my TF-IDF data into the nearest neighbors model.

In [None]:
nn = NearestNeighbors().fit(tf_idf_fit)

### Inputting desired person. Requires index.

In [None]:
df_original[df_original['name'] == 'Mary Goldring']

Unnamed: 0,URI,name,text,Text_Blob_Text
55,<http://dbpedia.org/resource/Mary_Goldring>,Mary Goldring,mary goldring obe is a british business journa...,"(m, a, r, y, , g, o, l, d, r, i, n, g, , o, ..."


In the case below, person 55 was chosen at random, and made into a reference.

In [None]:
sent0 = tf_idf_fit[55]
sent0.shape

(1, 437190)

In [None]:
distances, indices = nn.kneighbors(
  X = sent0,
  n_neighbors = 11,
)

In [None]:
distances

array([[0.        , 1.31359919, 1.33099633, 1.33208976, 1.3334915 ,
        1.33508681, 1.3369001 , 1.33863531, 1.3397137 , 1.34067823,
        1.34154   ]])

In [None]:
indices

array([[   55, 16046, 28857,   816, 31856, 25199, 10957, 27272, 15701,
        32763, 42651]])

In [None]:
transformed_data_frame = pd.DataFrame(indices)

In [None]:
d_i = nn.kneighbors(sent0, n_neighbors = df_original.shape[0])
distances, indices = np.array(d_i)
distances**2, indices


(array([[0.        , 1.72554282, 1.77155124, ..., 2.        , 2.        ,
         2.        ]]),
 array([[   55., 16046., 28857., ...,  8456., 29318., 30817.]]))

In [None]:
indices = indices.flatten()  # Convert 2D array to 1D
distances = distances.flatten() # Convert 2D distances to 1D

new_df = (
  df_original
    .iloc[indices]
    .join( pd.DataFrame( { "distances^2": distances**2 }, index = indices ) )
)



In [None]:
new_df = new_df.reset_index()

In [None]:
new_df[['index', 'name', 'distances^2']].head(11)

Unnamed: 0,index,name,distances^2
0,55.0,Mary Goldring,0.0
1,16046.0,Andreas Whittam Smith,1.725543
2,28857.0,Ceri Thomas,1.771551
3,816.0,Bruce Reynolds (TV personality),1.774463
4,31856.0,Bill Hagerty,1.7782
5,25199.0,Mike Embley,1.782457
6,10957.0,Terri Thompson,1.787302
7,27272.0,Russell Davies,1.791944
8,15701.0,Brian MacArthur,1.794833
9,32763.0,David Stafford,1.797418


## Part 2

Imports again. For some reason these do not work when I put them at the top of my notebook.

In [None]:
%%capture output
#install Wikipedia API
!pip3 install wikipedia-api
import wikipediaapi

In [None]:
!python -m textblob.download_corpora
!pip install -U textblob
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
df_original.iloc[55]['URI']

'<http://dbpedia.org/resource/Mary_Goldring>'

Checking below to see if Mary Goldring's wiki page is loading properly.

In [None]:
topic = 'Mary_Goldring'
wikip = wikipediaapi.Wikipedia(user_agent = 'foobar')
page_ex = wikip.page(topic)
wiki_text = page_ex.text
wiki_text

"Mary Sheila Goldring  (born 1923 - died 2016) was a British business journalist and broadcaster.\nAn economist who graduated from Lady Margaret Hall, Oxford University, Goldring turned to journalism in the late 1940s and became a member of staff at The Economist, where for a long time she was its Business Editor, rising to the rank of Deputy Editor alongside Norman McRae. She left the paper suddenly in spring 1974 following a dispute over its editorship in the wake of the surprise departure of Alastair Burnet, who left to become editor of the Daily Express.\nGoldring then moved to the BBC and meantime also wrote a weekly column for the Investors Chronicle, edited at the time by Andreas Whittam Smith. In 1976 she became one of the main regular presenters of BBC Radio 4's Analysis series of analytical authored current-affairs documentaries. She developed it into a flagship programme, staying with it until 1987. She also made five series of television documentaries, the Goldring Audit, f

In [None]:
TextBlob(wiki_text)

TextBlob("Mary Sheila Goldring  (born 1923 - died 2016) was a British business journalist and broadcaster.
An economist who graduated from Lady Margaret Hall, Oxford University, Goldring turned to journalism in the late 1940s and became a member of staff at The Economist, where for a long time she was its Business Editor, rising to the rank of Deputy Editor alongside Norman McRae. She left the paper suddenly in spring 1974 following a dispute over its editorship in the wake of the surprise departure of Alastair Burnet, who left to become editor of the Daily Express.
Goldring then moved to the BBC and meantime also wrote a weekly column for the Investors Chronicle, edited at the time by Andreas Whittam Smith. In 1976 she became one of the main regular presenters of BBC Radio 4's Analysis series of analytical authored current-affairs documentaries. She developed it into a flagship programme, staying with it until 1987. She also made five series of television documentaries, the Goldring A

In [None]:
Mary_Goldring_Sentiment = TextBlob(wiki_text).sentiment
Mary_Goldring_Sentiment

Sentiment(polarity=-0.07990476190476191, subjectivity=0.3018388278388278)

### With 10 closest nearest neighbors


Printing out a dataframe top_ten_nn to see who the people closest to Mary Goldring are.

In [None]:
top_ten_nn = new_df[['index', 'URI', 'name', 'distances^2']]

In [None]:
top_ten_nn = top_ten_nn.head(12)

### Function to retireve wiki text.

Below is a function meant to get the entire text from a person's wikipedia page. It uses a similar code to that used above, but it's just in a function so I can use it on the people from the top_ten_nn dataframe.

In [None]:
def get_wiki_text(person_name):
  """"
  Args: Takes the person's name. Needs to be a string.
    person_name: The link to the person's name.
    Returns: The text of the desired indices.
  """
  wikip = wikipediaapi.Wikipedia(user_agent = 'foobar')
  page = wikip.page(person_name)
  if page.exists():
    return page.text
  else:
    return None

### For loop to pull the texts from the function.

Pulls the text from wikipedia based on the person's name from the top_ten_nn.

In [None]:
top_ten_texts = []

for name in top_ten_nn['name']:
    wiki_text = get_wiki_text(name)
    if wiki_text:
        top_ten_texts.append(wiki_text)


### Coverting List into String Objects

Converting the texts into strings so they can be cleaned below.

In [None]:
top_ten_texts_str = [str(text) for text in top_ten_texts]

## Data Cleaning

### Cleaning the Strings

Have to do a for loop because this is a list. Getting rid of all the strange symbols and other things that might make the text blobs less readable.

In [None]:
wiki_text_clean = []

for text in top_ten_texts_str:
    cleaned_text = (
        text
        .replace("\n", " ")
        .replace("\'s", "")
        .replace("\'", "")
    )
    wiki_text_clean.append(cleaned_text)

### Converting Cleaned List Into Text Blobs

Converting the clean text list into text blobs so I can perform a sentiment analysis on whichever one I want.

In [None]:
wiki_clean_text_textblobs = []

for text in wiki_text_clean:
  wiki_clean_text_blobs = TextBlob(text)
  wiki_clean_text_textblobs.append(wiki_clean_text_blobs)

## Sentiment Analysis of Person 55

Below is the sentiment analysis of person 55 based on their entire wikipedia article.

In [None]:
wiki_clean_text_textblobs[0].sentiment

Sentiment(polarity=-0.07990476190476191, subjectivity=0.3018388278388278)

## Pattern Mining

### BoW

Vectorizing and bag of words below. Will be used in a TF-IDF.

In [None]:
vectorizer = CountVectorizer(stop_words='english')
bow_matrix = vectorizer.fit_transform(wiki_text_clean)

### TF-IDF

Fitting the bow matrix into the TF-IDF model.

In [None]:
tf_idf_matrix = TfidfTransformer()
tf_idf_wiki = tf_idf_matrix.fit_transform(bow_matrix)

### Nearest Neighbors

Performing a nearest neighbors on the full text from wikipedia. The reference corresponds to person 55, Mary Goldring, just like it did in part one.

In [None]:
reference = tf_idf_wiki[0]

In [None]:
nn = NearestNeighbors().fit(tf_idf_wiki)


In [None]:
distances, indices = nn.kneighbors(
  X = reference,
  n_neighbors = 11,
)


In [None]:
distances

array([[0.        , 1.26534466, 1.29569984, 1.29587619, 1.34130996,
        1.34626924, 1.35217868, 1.35229268, 1.36059251, 1.40004308,
        1.40737747]])

In [None]:
indices

array([[ 0,  2,  4,  6,  1,  5, 10,  7,  9,  3,  8]])

### New DF

Creating a new data frame called final_df to show the difference between the original nearest neighbors and the one based on all the text.

In [None]:
distance_df = pd.DataFrame({'indices': indices.flatten(), 'distances^2': distances.flatten()})


In [None]:
final_df = pd.merge(distance_df, top_ten_nn[['index', 'name']], left_on='indices', right_index=True, how='left')
final_df = final_df[['name', 'indices', 'distances^2']]

In [None]:
final_df['distances^2'] = final_df['distances^2']**2

In [None]:
final_df

Unnamed: 0,name,indices,distances^2
0,Mary Goldring,0,0.0
1,Ceri Thomas,2,1.601097
2,Bill Hagerty,4,1.678838
3,Terri Thompson,6,1.679295
4,Andreas Whittam Smith,1,1.799112
5,Mike Embley,5,1.812441
6,Brian Walpole,10,1.828387
7,Russell Davies,7,1.828695
8,David Stafford,9,1.851212
9,Bruce Reynolds (TV personality),3,1.960121


Old data frame to compare it to.

In [None]:
top_ten_nn = top_ten_nn.drop(['URI'], axis = 1)

In [None]:
top_ten_nn

Unnamed: 0,index,name,distances^2
0,55.0,Mary Goldring,0.0
1,16046.0,Andreas Whittam Smith,1.725543
2,28857.0,Ceri Thomas,1.771551
3,816.0,Bruce Reynolds (TV personality),1.774463
4,31856.0,Bill Hagerty,1.7782
5,25199.0,Mike Embley,1.782457
6,10957.0,Terri Thompson,1.787302
7,27272.0,Russell Davies,1.791944
8,15701.0,Brian MacArthur,1.794833
9,32763.0,David Stafford,1.797418


## Conclusion

**People closest to 55 based on part 1**:

1	Andreas Whittam Smith
2	Ceri Thomas
3	Bruce Reynolds (TV personality)
4	Bill Hagerty
5	Mike Embley
6	Terri Thompson
7	Russell Davies
8	Brian MacArthur
9	David Stafford
10 Brian Walpole

**Sentiment Analysis of Mary Goldring(person 55) part 1:**

Sentiment(polarity=-0.10098039215686273, subjectivity=0.2966817496229261)



**People closest to 55 based on part 2**:

1	Ceri Thomas
2	Bill Hagerty
3	Terri Thompson
4	Andreas Whittam Smith
5	Mike Embley
6	Brian Walpole
7	Russell Davies
8	David Stafford
9	Bruce Reynolds (TV personality)
10	Brian MacArthur

**Sentiment Analysis of Mary Goldring(person 55) part 2:**

Sentiment(polarity=-0.07990476190476191, subjectivity=0.3018388278388278)

In conclusion doing the BoW and TD-IDF for the summary of the text in part 1, and the entire wiki article in part 2 yielded different results as to which people were "closer" neighbors to our reference. This makes sense because there is far more material to be scanned, and more words to be considered meaningful in the analysis. In addition to this the sentiment analysis showed differed results. The polarity decreased slightly when I analyzed the entire text and the subjectivity increased ever so slightly. The differences in the sentiment analysis were small which indicates the summary was a good synopsis for the entire wiki text.