In [105]:
%%html
<style>
table {float:left}
</stype>

# Overview

As part of the Capstone project, data downloaded from resources like Kaggle and fast.ai and streaming data (from Twitter API using python) was analyzed.  
The sentiment data collected from amazon reviews, twitter airline review was modified to add a new field for sentiment of 0( negetive), positive (4) or neutral(2).  

## Analyze collected tweet data
The Capstone project being considered involves analyzing tweets for determining sentiment(positive, negetive and nuetral) and also to determine the personality type based on the Myers Briggs personality test.  
Data on which the sentiment prediction and personality prediction can be done on, was downloaded using the tweepy API.  
The analysis on the data is [tweepy](#tweepy)

## Train/Test Dataset Analysis
Labelled data that will be used for training /testing the models were used from the sources listed below

Dataset Description |Link to Analysis| 
 :--- | :---:|
1) Amazon review full score dataset|[amazon_reviews](#amazon_reviews)|
2) Airline Tweet sentiment Analysis |[airlines](#airlines)|
3) Twitter data for sentiment analysis|[sentiment](#sentiment)|
4) Myers Briggs Personality analysis|[Myersbriggs](#Myersbriggs)|

<a id='tweepy'></a>
## Collecting Tweets from Twitter( tweepy)
### Source:
Python script using the tweepy class was developed and used to collect a million tweets from Twitter. The script saves the tweets as json files, each file holidng 100,000 tweets.
The number of tweeets to be stored per file and the total number of tweets to be collected can be passed as command line arguments

### Description

There were 10 json files  created with 100,000 tweets each.  
The tweets were not cleaned( no emoticons, hyperlinks were removed). The text and full text of the extended test of the tweet were stored as separate fields


### Metadata
The fields used in the json file created and the status object field that the fields refer to as follows

Field Name | Status object Map to|Description| 
 :- |---| :-|
created_at |created_at |The time the status was posted in UTC timezone
text | text | The text of the tweet until 140 characters
extended_tweet_text |extended_tweet.full_text | Full text of the tweet including the 140 character limit
source |source | The source of the tweet.
user |user.id |The User ID object of the poster of the tweet.
followers_count|user.followers_count | Count of the followers for the user
friends_count| user.friends_count |Count of friends for the user
geo_enabled|user.geo_enabled | True of False depending on whether the tweet is geo-enabled or not
time_zone|time_zone |time zone of the tweet
geo |geo | The geo object of the tweet.
coordinates | coordinates | The coordinates of the tweet.



In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [13]:
tweets_file="..\\datasets\\tweets\\tweets_stream_20211220154418725171.json"

In [14]:
tweets_df=pd.read_json(tweets_file)

In [57]:
tweets_df

Unnamed: 0,created_at,text,extended_tweet_text,source,user,followers_count,friends_count,geo_enabled,time_zone,geo,coordinates
0,2021-12-20 20:10:53+00:00,"@MultiversePad Amazing project, letsgo to the ...",,"<a href=""http://twitter.com/download/iphone"" r...",1153690751777198082,79,247,True,,,
1,2021-12-20 20:10:53+00:00,RT @StarMusicPH: The official #KundimanDanceCh...,,"<a href=""http://twitter.com/download/android"" ...",1445940058683437061,39,168,False,,,
2,2021-12-20 20:10:53+00:00,@AccountableGOP @laurenboebert So she admits t...,,"<a href=""http://twitter.com/download/iphone"" r...",105547319,310,91,True,,,
3,2021-12-20 20:10:53+00:00,and its gonna be mine,,"<a href=""http://twitter.com/download/iphone"" r...",3005466034,1091,954,False,,,
4,2021-12-20 20:10:53+00:00,Get me lit! 🔥 https://t.co/AWVJdbdaMh,,"<a href=""http://twitter.com/download/iphone"" r...",1311029975114493954,1323,788,True,,,
...,...,...,...,...,...,...,...,...,...,...,...
99995,2021-12-20 20:44:11+00:00,your in your 20s on stan twitter it’s clear yo...,your in your 20s on stan twitter it’s clear yo...,"<a href=""http://twitter.com/download/iphone"" r...",1344100668366327814,719,826,False,,,
99996,2021-12-20 20:44:11+00:00,"RT @BostonGlobe: Since 2020, judges released a...",,"<a href=""https://mobile.twitter.com"" rel=""nofo...",394419056,166,879,False,,,
99997,2021-12-20 20:44:11+00:00,RT @ellisonbg: 1) Today (12/18/2021 GMT) marks...,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",120372396,7012,6423,False,,,
99998,2021-12-20 20:44:11+00:00,@yggsea Sometimes you forgot or didn't know yo...,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",613475375,71,93,True,,,


<a id='amazon_reviews'></a>
## Amazon Review Full Score Dataset

### Source: https://course.fast.ai/datasets  
The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. For more information, please refer to the following paper: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.  
The Amazon reviews full score dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).  

### Description

The Amazon reviews full score dataset is constructed by randomly taking 600,000 training samples and 130,000 testing samples for each review score from 1 to 5. In total there are 3,000,000 trainig samples and 650,000 testing samples.
The files train.csv and test.csv contain all the training samples as comma-sparated values.

### MetaData
Field Name |Description| 
 :- | :-|
class index | The rating value from 1 to 5|
review_title| Title of the review. The review title and text are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".|
review_text| Review text. The review title and text are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".|


In [15]:
amazon_review_file="..\\datasets\\amazon_review_full_csv\\amazon_reviews_train.csv"
columns=['class_index','review_title','review_text']

In [17]:
reviews_df=pd.read_csv(amazon_review_file,names=columns)

In [18]:
reviews_df

Unnamed: 0,class_index,review_title,review_text
0,3,more like funchuck,Gave this to my dad for a gag gift after direc...
1,5,Inspiring,I hope a lot of people hear this cd. We need m...
2,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
3,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...
4,5,Too good to be true,Probably the greatest soundtrack in history! U...
...,...,...,...
2999995,1,Don't do it!!,The high chair looks great when it first comes...
2999996,2,"Looks nice, low functionality",I have used this highchair for 2 kids now and ...
2999997,2,"compact, but hard to clean","We have a small house, and really wanted two o..."
2999998,3,Hard to clean!,I agree with everyone else who says this chair...


In [20]:
reviews_df['class_index'].value_counts()

5    600000
4    600000
3    600000
2    600000
1    600000
Name: class_index, dtype: int64

### Converting the class index to negetive(value 0),neutral(value 2) and postive (value 4)

In [96]:
reviews_conditions=[(reviews_df['class_index'].isin([1,2])),
                    ((reviews_df['class_index']==3)),
                    (reviews_df['class_index'].isin([4,5])) 
    
                  ]
review_values=[0,2,4]

In [88]:
reviews_df['sentiment']=np.select(reviews_conditions,review_values)

In [89]:
reviews_df

Unnamed: 0,class_index,review_title,review_text,sentiment
0,3,more like funchuck,Gave this to my dad for a gag gift after direc...,2
1,5,Inspiring,I hope a lot of people hear this cd. We need m...,4
2,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...,4
3,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...,4
4,5,Too good to be true,Probably the greatest soundtrack in history! U...,4
...,...,...,...,...
2999995,1,Don't do it!!,The high chair looks great when it first comes...,0
2999996,2,"Looks nice, low functionality",I have used this highchair for 2 kids now and ...,0
2999997,2,"compact, but hard to clean","We have a small house, and really wanted two o...",0
2999998,3,Hard to clean!,I agree with everyone else who says this chair...,2


<a id='airlines'></a>
## Airline Tweet sentiment
### Source -https://www.kaggle.com/crowdflower/twitter-airline-sentiment
    
### Description
This data originally came from Crowdflower's Data for Everyone library.

As the original source says,

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

The data on Kaggle is a slightly reformatted version of the original source.
### MetaData
The following 15 fields in the .csv file

Field Name |Description| 
------------|--------------|
tweet_id | Twittterr  tweet id |
airline_sentiment | Values of 'postive','negetive' and 'neutral'|
airline_sentiment_confidence| Confidence level of the sentiment
negativereason | Reason for negetive sentiment
negativereason_confidence | Confidence level for Negetive sentiment reason
airline| name of the airline
airline_sentiment_gold | all values Nan. Purpose of the field unclear 
name | twitter person name
negativereason_gold|all values NaN . Purpose of the field unclear
retweet_count | count of tweeets for the tweet
text | tweet text
tweet_coord | coordinates for the tweet
tweet_created| tweet created time
tweet_location | tweet location
user_timezone| tweet timezone

In [19]:
tweet_airline="..\\datasets\\twitter_airline_sentiment\\Tweets.csv"

In [20]:
tweet_airline_df=pd.read_csv(tweet_airline)

In [21]:
tweet_airline_df.shape

(14640, 15)

In [22]:
tweet_airline_df.head(10)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada)
6,570300616901320704,positive,0.6745,,0.0,Virgin America,,cjmcginnis,,0,"@VirginAmerica yes, nearly every time I fly VX...",,2015-02-24 11:13:57 -0800,San Francisco CA,Pacific Time (US & Canada)
7,570300248553349120,neutral,0.634,,,Virgin America,,pilot,,0,@VirginAmerica Really missed a prime opportuni...,,2015-02-24 11:12:29 -0800,Los Angeles,Pacific Time (US & Canada)
8,570299953286942721,positive,0.6559,,,Virgin America,,dhepburn,,0,"@virginamerica Well, I didn't…but NOW I DO! :-D",,2015-02-24 11:11:19 -0800,San Diego,Pacific Time (US & Canada)
9,570295459631263746,positive,1.0,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2015-02-24 10:53:27 -0800,Los Angeles,Eastern Time (US & Canada)


### Converting the class index to negetive(value 0),neutral(value 2) and postive (value 4)

In [93]:
airline_conditions=[(tweet_airline_df['airline_sentiment']=='negetive'),
                    ((tweet_airline_df['airline_sentiment']=='neutral')),
                    (tweet_airline_df['airline_sentiment']=='postive') 
    
                  ]
airline_values=[0,2,4]

In [94]:
tweet_airline_df['sentiment']=np.select(airline_conditions,airline_values)

In [95]:
tweet_airline_df

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,sentiment
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),2
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),0
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),2
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),0
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,,0
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,,0
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",,2
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada),0


## Twitter Data for Sentiment
<a id='sentiment'></a>

### Source: http://help.sentiment140.com/for-students

### Description
The training data was automatically created, as opposed to having humans manual annotate tweets. In the approach used, any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search. This is described in the following paper(https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf)
The data is a CSV with emoticons removed.

### MetaData

Field Name |Description| 
------------|--------------|
sentiment | the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)  |
tweet_id| the id of the tweet |
tweet_date|the date of the tweet in UTC  |
query| the query used. If there is no query, then this value is NO_QUERY.  |
username| the user that tweeted  |
tweet|the text of the tweet  |

In [27]:
twitter_sentiment140="..\\datasets\\twitter_sentiment140\\training_140_sentiment.csv"
column_list_sentiment=['sentiment','tweet_id','tweet_date','query','username','tweet']

In [28]:
setiment140_df=pd.read_csv(twitter_sentiment140,encoding='cp1252',names=column_list_sentiment)

In [29]:
setiment140_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
sentiment     1600000 non-null int64
tweet_id      1600000 non-null int64
tweet_date    1600000 non-null object
query         1600000 non-null object
username      1600000 non-null object
tweet         1600000 non-null object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [31]:
setiment140_df

Unnamed: 0,sentiment,tweet_id,tweet_date,query,username,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [77]:
setiment140_df['sentiment'].value_counts()

4    800000
0    800000
Name: sentiment, dtype: int64

<a id='MyersBriggs'></a>
# Myers Briggs Personality Type

### Source: https://www.kaggle.com/datasnaek/mbti-type

### Description

The Myers Briggs Type Indicator (or MBTI for short) is a personality type system that divides everyone into 16 distinct personality types across 4 axis:

Introversion (I) – Extroversion (E)  
Intuition (N) – Sensing (S)  
Thinking (T) – Feeling (F)  
Judging (J) – Perceiving (P)  
(More can be learned about what these mean here)

So for example, someone who prefers introversion, intuition, thinking and perceiving would be labelled an INTP in the MBTI system, and there are lots of personality based components that would model or describe this person’s preferences or behaviour based on the label.

It is one of, if not the, the most popular personality test in the world. It is used in businesses, online, for fun, for research and lots more. A simple google search reveals all of the different ways the test has been used over time. It’s safe to say that this test is still very relevant in the world in terms of its use.

From scientific or psychological perspective it is based on the work done on cognitive functions by Carl Jung i.e. Jungian Typology. This was a model of 8 distinct functions, thought processes or ways of thinking that were suggested to be present in the mind. Later this work was transformed into several different personality systems to make it more accessible, the most popular of which is of course the MBTI.

Recently, its use/validity has come into question because of unreliability in experiments surrounding it, among other reasons. But it is still clung to as being a very useful tool in a lot of areas, and the purpose of this dataset is to help see if any patterns can be detected in specific types and their style of writing, which overall explores the validity of the test in analysing, predicting or categorising behaviour.

## Acknowledgements  
This data was collected through the PersonalityCafe forum, as it provides a large selection of people and their MBTI personality type, as well as what they have written.

### MetaData
Field Name |Description| 
------------|--------------|
Type|This persons 4 letter MBTI code/type|
posts|A section of each of the last 50 things they have posted (Each entry separated by "|||" (3 pipe characters))|
  

#

In [32]:
myers_briggs_csv='..\\datasets\\myers_briggs_personality_test\\mbti_1.csv'

In [33]:
myers_briggs_df=pd.read_csv(myers_briggs_csv)

In [34]:
myers_briggs_df['type'].value_counts()

INFP    1832
INFJ    1470
INTP    1304
INTJ    1091
ENTP     685
ENFP     675
ISTP     337
ISFP     271
ENTJ     231
ISTJ     205
ENFJ     190
ISFJ     166
ESTP      89
ESFP      48
ESFJ      42
ESTJ      39
Name: type, dtype: int64

In [35]:
myers_briggs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8675 entries, 0 to 8674
Data columns (total 2 columns):
type     8675 non-null object
posts    8675 non-null object
dtypes: object(2)
memory usage: 135.7+ KB


In [36]:
myers_briggs_df

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...
...,...,...
8670,ISFP,'https://www.youtube.com/watch?v=t8edHB_h908||...
8671,ENFP,'So...if this thread already exists someplace ...
8672,INTP,'So many questions when i do these things. I ...
8673,INFP,'I am very conflicted right now when it comes ...
