<a href="https://colab.research.google.com/github/dylandb38/hw4/blob/main/QTM250_HW4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="image/title_pic.png" width=750 style="display=block; margin:auto" />

# **QTM250 Homework 4**: Using Machine Learning APIs
### **Group 7** - *Emily Wei, Chen Gong, Sharon Qian, Ricardo Liu, Dylan Douglas-Brown, Jerry Q*

# Introduction
Sentiment Analysis is a method for determining if a given piece of text is positive, negative, or neutral. To assign sentiment scores to the topics, categories, or entities within a phrase, text analytics uses a combination of natural language processing (NLP), and machine learning (ML) approaches. 

In this blog post, we will show how sentiment analysis is used in BBC's articles. By summarizing and analyzing the value given by sentiment analysis, which is the score in this context, we tried to figure out how the sentiment analysis determines people's attitudes toward different categories. According to the definition of sentiment analysis score, the higher the score is, the more positive people view this category, and a score below 0 would turn out to be a negative attitude. We would use a tool like this to analyze people's attitudes toward different categories of articles - *How well does this API do its job?*

In [None]:
import getpass
# use this: AIzaSyCvs6Dk_yHt07c7MjQytJaMQ2d5j1SMlFw
APIKEY = getpass.getpass()

··········


In [None]:
import build
from googleapiclient.discovery import build

# Methods Walkthrough:

### **Importing data from GitHub repository**

We began by importing necessary packages and our identified public datsaet. 

*Included here is the Github Repo Link that includes the public dataset:* https://github.com/cgong99/qtm250-ML

*Included here is the Github Repo Link containing all of the data and documentation needed:*
https://github.com/dylandb38/hw4

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/cgong99/qtm250-ML/main/bbc-text.csv')
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [None]:
#@title
texts = list(df['text'])[:1000]
category = list(df['category'])[:1000]

### **Calling the sentiment analysis API**
The code included below serves to pass each article and get the sentiment score and magnitude

In [None]:
lservice = build('language', 'v1beta1', developerKey=APIKEY)
magnitude_list = []
polarity_list = []
score_list = []
for quote in texts:
  response = lservice.documents().analyzeSentiment(
    body={
      'document': {
        'type': 'PLAIN_TEXT',
        'content': quote
      }
    }).execute()
  magnitude = response['documentSentiment']['magnitude']
  polarity = response['documentSentiment']['polarity']
  score = response['documentSentiment']['score']
  magnitude_list.append(magnitude)
  polarity_list.append(polarity)
  score_list.append(score)

In [None]:
score_df = pd.DataFrame(list(zip(category,score_list)), columns=['category','score'])
score_df

Unnamed: 0,category,score
0,tech,0.0
1,business,-0.4
2,sport,-0.3
3,sport,0.0
4,entertainment,0.1
...,...,...
995,entertainment,0.2
996,sport,0.3
997,entertainment,-0.6
998,sport,0.1


In [None]:
from google.colab import drive
from google.colab import files
drive.mount('drive')
score_df.to_csv("bbc_sentiment.csv")
!cp bbc_sentiment.csv "drive/My Drive/"
files.download('bbc_sentiment.csv')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Graphs and Results

### Link to Google Spreadsheet: 
https://docs.google.com/spreadsheets/d/1c7dpUDSjrVUFzt4x1yV_8MACkkPkvo1-gSvHox8sTJg/edit?usp=sharing

<img src="image/pic1.png" width=600/>

<img src="image/pic2.png" width=800/>

## Polarities Distribution Among Five Groups
Using predict function below (basically calling analyzeSentiment service), we also got lists of polarities and magnitudes of all five groups respectively. The figure below shows the distribution of polarity score among five groups, from tech to politics, and there is a clear pattern that text samples categoried in business and politics groups have higher polarity scores in general.

link to google spreadsheet:

https://docs.google.com/spreadsheets/d/16mLkpOIi24XIFMH3JONr8vCGkx7EE7lXl94-jC_8U3A/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1FDtQsDKEcASLRoNvS7tJEnRaOJzMufnqMtiywREDGAw/edit?usp=sharing


<img src="image/Polarities.png" width=800/>

## Magnitudes Distribution Among Five Groups
For magnitudes, text samples categoried in entertainment and politics have higher values in general. This illustrates there is more emotional content presented within these two groups.

<img src="image/pic3.png" width=800/>


The architecture of our project is as illustrated in the diagram above. We first obtained our BBC data through Github repo and pulled it to Colab Notebook. We then preprocessed the dataset using Colab and called Google Natural Language API. The result was exported into a csv file and stored in cloud storage for later analysis and visualization using sheet and Python. 

# Discussion

The Sentiment Analysis API seems to provide valence ratings of greater magnitude for certain content categories; in other words, the generally emotional nature of economic and political text blurbs input into this API might be driving the magnitude of the API's rating, rather than accurately reflecting the degree of emotionality expressed by the writer.

For this reason, it might be difficult to use the Sentiment Analysis API to compare emotionality *across* content areas, because a positive rating in a political text might not require as many "charged" words as a news blurb, for example, would need to receive the same sentiment score.

Our analysis has encouraged us to ask further questions about how manipulating the content of "neutral" words might influence the way the API accounts for and weights "emotional" words in its analysis. 