# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('twitter-archive-enhanced.csv')
df.head()
# Twitter ids.
ids = df.tweet_id.values
# Visual assessment
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [2]:
import requests

r = requests.get(' https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
# Code which does not work.
#df_predictions = pd.read_csv(r.text, sep='\t')


# Uglier code which does work.
a = [i.split('\t') for i in r.text.split('\n')]
a = [item for item in a if len(item) == 12]
columns = a[0]
data = np.array(a[1:])
df_predictions = pd.DataFrame(data=data, columns=columns)



# Check for null values.
df_predictions.isnull().values.any()

# Convert id to integer as merge is done on this column. 
df_predictions['tweet_id'] = df_predictions['tweet_id'].astype('int64')
df_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null object
p1          2075 non-null object
p1_conf     2075 non-null object
p1_dog      2075 non-null object
p2          2075 non-null object
p2_conf     2075 non-null object
p2_dog      2075 non-null object
p3          2075 non-null object
p3_conf     2075 non-null object
p3_dog      2075 non-null object
dtypes: int64(1), object(11)
memory usage: 194.6+ KB


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [3]:
import tweepy,json

# I intented to extraxt the data from twitter but they did require eleveted
# account an that takes time. 
# I decided to use the text file provided by Udacity

#ids,favorite_counts, retweet_counts = [],[],[]

f = open("tweet-json.txt", "r")
data = {}
i = 0
for line in f:
    d = json.loads(line)
    for key in d.keys():
        if key in data:
            data[key].append(d[key])
        else:
            data[key] = [d[key]]
    i += 1
print('number of rows in tweet-json', i)
# Select only data in from all rows.

data_reduced = {}
for key in data.keys():
    if len(data[key])==i:
        data_reduced[key] = data[key]


        
df_tweet = pd.DataFrame(data_reduced)
df_tweet.head()

number of rows in tweet-json 2354


Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,source,in_reply_to_status_id,in_reply_to_status_id_str,...,geo,coordinates,place,contributors,is_quote_status,retweet_count,favorite_count,favorited,retweeted,lang
0,Tue Aug 01 16:23:56 +0000 2017,892420643555336193,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://twitter.com/download/iphone"" r...",,,...,,,,,False,8853,39467,False,False,en
1,Tue Aug 01 00:17:27 +0000 2017,892177421306343426,892177421306343426,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://twitter.com/download/iphone"" r...",,,...,,,,,False,6514,33819,False,False,en
2,Mon Jul 31 00:18:03 +0000 2017,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://twitter.com/download/iphone"" r...",,,...,,,,,False,4328,25461,False,False,en
3,Sun Jul 30 15:58:51 +0000 2017,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://twitter.com/download/iphone"" r...",,,...,,,,,False,8964,42908,False,False,en
4,Sat Jul 29 16:00:24 +0000 2017,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","<a href=""http://twitter.com/download/iphone"" r...",,,...,,,,,False,9774,41048,False,False,en


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [4]:
# Merge both dataframes.
df = df.merge(df_tweet, left_on='tweet_id',right_on='id')
df = df.merge(df_predictions, left_on='tweet_id', right_on='tweet_id')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2073 entries, 0 to 2072
Data columns (total 52 columns):
tweet_id                      2073 non-null int64
in_reply_to_status_id_x       23 non-null float64
in_reply_to_user_id_x         23 non-null float64
timestamp                     2073 non-null object
source_x                      2073 non-null object
text                          2073 non-null object
retweeted_status_id           79 non-null float64
retweeted_status_user_id      79 non-null float64
retweeted_status_timestamp    79 non-null object
expanded_urls                 2073 non-null object
rating_numerator              2073 non-null int64
rating_denominator            2073 non-null int64
name                          2073 non-null object
doggo                         2073 non-null object
floofer                       2073 non-null object
pupper                        2073 non-null object
puppo                         2073 non-null object
created_at                    2073 

### Quality issues
#### Merged Table
1. **Completeness issues**. Some columns in the table do have 2356 non-null entries while other have less, down to 78 non-null entries. Some null values are represented as strings instead of NaN. 

2. **Datatype issue**: The *timestamp* and *retweeted_status_timestamp* columns should be of the type datetime instead of object. 

3. **Data Consistency**: The *source_x* column should be categorical and contain Twitter for iPhone, Twitter Web Client, Vine and TweetDeck instead of an anchor tag.

4. **Data Validity**: The *name* column contains names with lowercase letters. Right names can not be determined from the text column.

5. **Datatype issue**: The *doggo*, *floofer*, *pupper* and *puppo* columns should not be of type object. They should be of type boolean. NaN values do not represent missing values. These columsn should also be joined into a single column (tidiness).

6. **Datatype issue**: The *lang* column should be of datatype category.

7. **Datatype consistency**: Some missing values are represented as None instead of NaN. This is inconsistency and NaN was chosen as it works better with numerical values. 

8. **Data Validity**: The *display_text_range* holds record of the length of each text. This is better displayed as a length of the text.

9. **Datatype issue**: Further proper datatype conversions needed.


### Tidiness issues
#### Merged Table

1. **Each variable should form a column**: Remove columns like *id_str* contains repeated or redundant information.

2. **Each variable should form a column**: The *doggo*, *puppy*, *floofer* and *pupper* should be in a single *stage* column with the column names as values.

3. **Each variable should form a column**: Remove columns which contain dictionaries. Each dictonary holds multible values. 

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [5]:
# Make copies of original pieces of data
df_copy = df.copy()
df_copy.head()


Unnamed: 0,tweet_id,in_reply_to_status_id_x,in_reply_to_user_id_x,timestamp,source_x,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,...,1,orange,0.0970486,False,bagel,0.0858511,False,banana,0.07611,False
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,...,1,Chihuahua,0.323581,True,Pekinese,0.0906465,True,papillon,0.0689569,True
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,...,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.0313789,True
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,...,1,paper_towel,0.1702779999999999,False,Labrador_retriever,0.1680859999999999,True,spatula,0.0408359,False
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,...,2,basset,0.555712,True,English_springer,0.2257699999999999,True,German_short-haired_pointer,0.175219,True


### Issue #1:

#### Define: Replace all None strings with NaN.

#### Code

In [6]:
df_copy.replace('None',np.nan,inplace=True)


#### Test

In [7]:
# These columns: doggo, floofer, pupper, puppo should only contain their 
# column name as a proper value.
print(df_copy['puppo'].unique(), df_copy['pupper'].unique(), df_copy['floofer'].unique(),df_copy['pupper'].unique())


[nan 'puppo'] [nan 'pupper'] [nan 'floofer'] [nan 'pupper']


### Issue #2:

#### Define
The timestamp and retweeted_status_timestamp columns should be of the type datetime instead of object.

#### Code

In [8]:
df_copy['timestamp'] = pd.to_datetime(df_copy['timestamp'])
df_copy['retweeted_status_timestamp'] = pd.to_datetime(df_copy['timestamp'])


#### Test

In [9]:
# timestamp and retweeted_status_timestamp do have datetype datetime.
df_copy.dtypes

tweet_id                               int64
in_reply_to_status_id_x              float64
in_reply_to_user_id_x                float64
timestamp                     datetime64[ns]
source_x                              object
text                                  object
retweeted_status_id                  float64
retweeted_status_user_id             float64
retweeted_status_timestamp    datetime64[ns]
expanded_urls                         object
rating_numerator                       int64
rating_denominator                     int64
name                                  object
doggo                                 object
floofer                               object
pupper                                object
puppo                                 object
created_at                            object
id                                     int64
id_str                                object
full_text                             object
truncated                               bool
display_te

### Issue #3:

#### Define
The *source* column should be categorical and contain Twitter for iPhone, Twitter Web Client, Vine and TweetDeck instead of an anchor tag.

#### Code

In [10]:
df_copy['source_x'].unique()
df_copy['source_x'].replace(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
       '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'], ['Twitter for iphone',
                                                                                                'Twitter Web Client',
                                                                                               'Vine','TweetDeck'],
                          inplace=True)
df_copy['source_x']=df_copy['source_x'].astype('category')

#### Test

In [11]:
df_copy.dtypes
#df_copy.sample(5)

tweet_id                               int64
in_reply_to_status_id_x              float64
in_reply_to_user_id_x                float64
timestamp                     datetime64[ns]
source_x                            category
text                                  object
retweeted_status_id                  float64
retweeted_status_user_id             float64
retweeted_status_timestamp    datetime64[ns]
expanded_urls                         object
rating_numerator                       int64
rating_denominator                     int64
name                                  object
doggo                                 object
floofer                               object
pupper                                object
puppo                                 object
created_at                            object
id                                     int64
id_str                                object
full_text                             object
truncated                               bool
display_te

### Issue #4:

#### Define
The *name* column contains names with lowercase letters. Right names can not be determined from the text column. Replace these words with NaN

#### Code


In [12]:
lower_case_names = list(set(df_copy[(df_copy.name != np.nan) & df_copy.name.str.islower()].name))
print(lower_case_names)
df_copy['name']=df_copy.name.replace(lower_case_names,None)

['quite', 'one', 'infuriating', 'such', 'just', 'not', 'light', 'all', 'officially', 'the', 'getting', 'his', 'space', 'actually', 'unacceptable', 'a', 'incredibly', 'by', 'very', 'this', 'my', 'an']


#### Test

In [13]:
# No row contains a name that starts with lower letters.
df_copy.name.str.islower().fillna(False).unique()

array([False], dtype=bool)

### Issue #5:

#### Define


The *doggo*, *floofer*, *pupper* and *puppo* columns should be of the type boolean instead of object. NaN values do not represent missing values. These columsn should also be joined into a single column (tidiness).

In [14]:
#### Code
stages = ['doggo','floofer','pupper','puppo']
for i in stages:
    df_copy[i] = df_copy[i].fillna(False)
    df_copy[i] = df_copy[i].replace(i,True)
    df_copy[i] = df_copy[i].astype(bool)

#### Test

In [15]:
print(df_copy['puppo'].unique(), df_copy['pupper'].unique(), df_copy['floofer'].unique(),df_copy['pupper'].unique())


[False  True] [False  True] [False  True] [False  True]


### Issue #6:

#### Define
The *lang* column should be of datatype category.

#### Code

In [16]:
df_copy['lang'] = df_copy.lang.astype('category');

#### Test

In [17]:
df_copy.lang.dtype

CategoricalDtype(categories=['en', 'et', 'eu', 'in', 'nl', 'ro'], ordered=False)

### Issue #7

#### Define

Some missing values are represented as None instead of NaN.

#### Coding

In [18]:
df_copy.fillna(value=np.nan,inplace=True)

#### Test

In [19]:
# The geo column only contains nan value.
df_copy.geo.unique()

array([ nan])

### Issue #8
#### Define
The *display_text_range* holds record of the length of each text. This is better displayed as a length of the text.
#### Coding

In [20]:
df_copy['text_length'] = df_copy['display_text_range'].apply(lambda x: x[1]+1)

#### Test

In [21]:
# A new column with the length of each text does exist.
df_copy['text_length'].dtype
df_copy.text_length.sample(5)

893      55
69      138
1140     66
627      90
1408     68
Name: text_length, dtype: int64

### Issue #9
#### Define
Further proper datatype conversions needed.
#### Coding

In [22]:

df_copy['img_num'] = df_copy['img_num'].astype('int64')
df_copy['p1_conf'] = df_copy['p1_conf'].astype('float64')
df_copy['p2_conf'] = df_copy['p2_conf'].astype('float64')
df_copy['p3_conf'] = df_copy['p3_conf'].astype('float64')
df_copy['p1_dog'] = df_copy['p1_dog'].astype('bool')
df_copy['p2_dog'] = df_copy['p2_dog'].astype('bool')
df_copy['p3_dog'] = df_copy['p3_dog'].astype('bool')

#### Testing

In [23]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2073 entries, 0 to 2072
Data columns (total 53 columns):
tweet_id                      2073 non-null int64
in_reply_to_status_id_x       23 non-null float64
in_reply_to_user_id_x         23 non-null float64
timestamp                     2073 non-null datetime64[ns]
source_x                      2073 non-null category
text                          2073 non-null object
retweeted_status_id           79 non-null float64
retweeted_status_user_id      79 non-null float64
retweeted_status_timestamp    2073 non-null datetime64[ns]
expanded_urls                 2073 non-null object
rating_numerator              2073 non-null int64
rating_denominator            2073 non-null int64
name                          1457 non-null object
doggo                         2073 non-null bool
floofer                       2073 non-null bool
pupper                        2073 non-null bool
puppo                         2073 non-null bool
created_at             

## Tidiness issues

### Issue #1
#### Define
Remove columns like id_str contains repeated or redundant information.
#### Coding

In [24]:
#df_copy.info()
#df_copy.source_x.equals(df_copy.source_y)

# Drop all columns with repeated information.
# Drop all columns which ar string representations of integers.
# Drop all columns with zero non-null values.
df_copy.text.equals(df_copy.full_text)
df_copy.drop(columns=['source_y', 
                      'in_reply_to_status_id_y',
                      'in_reply_to_user_id_y',
                     'id',
                     'id_str',
                     'in_reply_to_status_id_str',
                     'in_reply_to_user_id_str',
                     'user','geo','contributors',
                     'coordinates','created_at','display_text_range',
                     'full_text'], axis=1,inplace=True)


#### Test

In [25]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2073 entries, 0 to 2072
Data columns (total 39 columns):
tweet_id                      2073 non-null int64
in_reply_to_status_id_x       23 non-null float64
in_reply_to_user_id_x         23 non-null float64
timestamp                     2073 non-null datetime64[ns]
source_x                      2073 non-null category
text                          2073 non-null object
retweeted_status_id           79 non-null float64
retweeted_status_user_id      79 non-null float64
retweeted_status_timestamp    2073 non-null datetime64[ns]
expanded_urls                 2073 non-null object
rating_numerator              2073 non-null int64
rating_denominator            2073 non-null int64
name                          1457 non-null object
doggo                         2073 non-null bool
floofer                       2073 non-null bool
pupper                        2073 non-null bool
puppo                         2073 non-null bool
truncated              

### Issue #2
#### Define
The doggo, puppy, floofer and pupper should be in a single stage column with the column names as values.
#### Coding

In [26]:
# Feel free to suggest shorter solution
def convert_to_stage(l):
    names = ['doggo','floofer','pupper','puppo']
    return_list = []
    for i in range(len(l)):
        if l[i]: 
            return_list.append(names[i])
    return '-'.join(return_list)

df_copy['stage'] = df_copy[['doggo','floofer','pupper','puppo']].apply(lambda x: convert_to_stage(list(x.values)),axis=1)
df_copy['stage'].replace('',np.nan,inplace=True)
df_copy['stage'] = df_copy['stage'].astype('category')
df_copy = df_copy.drop(columns=['doggo','floofer','pupper','puppo'])

In [27]:
df_copy['stage'].unique()

[NaN, doggo, puppo, pupper, floofer, doggo-puppo, doggo-floofer, doggo-pupper]
Categories (7, object): [doggo, puppo, pupper, floofer, doggo-puppo, doggo-floofer, doggo-pupper]

### Issue #3
#### Define
Remove columns which contain dictionaries. Each dictonary holds multible values.
#### Coding

In [28]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2073 entries, 0 to 2072
Data columns (total 36 columns):
tweet_id                      2073 non-null int64
in_reply_to_status_id_x       23 non-null float64
in_reply_to_user_id_x         23 non-null float64
timestamp                     2073 non-null datetime64[ns]
source_x                      2073 non-null category
text                          2073 non-null object
retweeted_status_id           79 non-null float64
retweeted_status_user_id      79 non-null float64
retweeted_status_timestamp    2073 non-null datetime64[ns]
expanded_urls                 2073 non-null object
rating_numerator              2073 non-null int64
rating_denominator            2073 non-null int64
name                          1457 non-null object
truncated                     2073 non-null bool
entities                      2073 non-null object
in_reply_to_screen_name       23 non-null object
place                         1 non-null object
is_quote_status       

In [29]:
df_copy.drop(columns=['place','entities'],axis=1,inplace=True)
df_copy.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2073 entries, 0 to 2072
Data columns (total 34 columns):
tweet_id                      2073 non-null int64
in_reply_to_status_id_x       23 non-null float64
in_reply_to_user_id_x         23 non-null float64
timestamp                     2073 non-null datetime64[ns]
source_x                      2073 non-null category
text                          2073 non-null object
retweeted_status_id           79 non-null float64
retweeted_status_user_id      79 non-null float64
retweeted_status_timestamp    2073 non-null datetime64[ns]
expanded_urls                 2073 non-null object
rating_numerator              2073 non-null int64
rating_denominator            2073 non-null int64
name                          1457 non-null object
truncated                     2073 non-null bool
in_reply_to_screen_name       23 non-null object
is_quote_status               2073 non-null bool
retweet_count                 2073 non-null int64
favorite_count        

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [30]:
df_copy.to_csv('twitter_archive_master.csv')

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [31]:
df = pd.read_csv('twitter_archive_master.csv')
df = df.drop(columns=['Unnamed: 0'],axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2073 entries, 0 to 2072
Data columns (total 34 columns):
tweet_id                      2073 non-null int64
in_reply_to_status_id_x       23 non-null float64
in_reply_to_user_id_x         23 non-null float64
timestamp                     2073 non-null object
source_x                      2073 non-null object
text                          2073 non-null object
retweeted_status_id           79 non-null float64
retweeted_status_user_id      79 non-null float64
retweeted_status_timestamp    2073 non-null object
expanded_urls                 2073 non-null object
rating_numerator              2073 non-null int64
rating_denominator            2073 non-null int64
name                          1457 non-null object
truncated                     2073 non-null bool
in_reply_to_screen_name       23 non-null object
is_quote_status               2073 non-null bool
retweet_count                 2073 non-null int64
favorite_count                2073 non-n

In [32]:
df.groupby(['source_x']).tweet_id.count()

source_x
TweetDeck               11
Twitter Web Client      30
Twitter for iphone    2032
Name: tweet_id, dtype: int64

In [33]:
df.groupby(['lang']).tweet_id.count()

lang
en    2065
et       1
eu       1
in       2
nl       3
ro       1
Name: tweet_id, dtype: int64

In [34]:
df['retweet_count'].describe()

count     2073.000000
mean      2976.089243
std       5054.897526
min         16.000000
25%        634.000000
50%       1408.000000
75%       3443.000000
max      79515.000000
Name: retweet_count, dtype: float64

In [35]:
df['retweet_count'][df['retweet_count'] > 40000]

59     45849
209    56625
329    48265
355    42228
358    42228
432    56625
851    79515
886    52360
Name: retweet_count, dtype: int64

In [36]:
df['retweet_count'][df['retweet_count'] <= 2976].count()/2073

0.7134587554269175

In [37]:
df['favorite_count'].describe()

count      2073.000000
mean       8556.718283
std       12098.640994
min           0.000000
25%        1674.000000
50%        3864.000000
75%       10937.000000
max      132810.000000
Name: favorite_count, dtype: float64

In [38]:
df['favorite_count'][df['favorite_count'] <= 8556.718283].count()/2073

0.69126869271587077

### Insights:

1. Almost all tweets are in english on an Twitter iPhone.

2. The retweet_count has a right skew. The mean is more the double the median. This is because realitvely few tweets are retweeted very often.

3. The favorite_count has even more right skewed than the retweet count.

### Visualization

In [39]:
df.plot.scatter(x='favorite_count',
                y='retweet_count',
                c='Blue', 
                title='Relationship between favorite_count and retweet_count',
                figsize=(10,4))

<matplotlib.axes._subplots.AxesSubplot at 0x7fc71479fe48>

#### Remarks
Some tweets have high retweet count but low favorite count (far left). Apart from the far left tweet that are often retweeted are often favorited.

In [40]:
# Pearson Mode Skewness
3*(df['favorite_count'].mean()-df['favorite_count'].median())/df['favorite_count'].std()


1.1636145626238581

In [41]:
# Pearson Mode Skewness

3*(df['retweet_count'].mean()-df['retweet_count'].median())/df['retweet_count'].std()



0.93063562697034541

In [42]:
#far_left
far_left = df_copy[(df_copy['favorite_count']==0) & (df_copy['retweet_count']>3443)]
frequency = far_left.groupby('p1').count().tweet_id/far_left.groupby('p1').count().tweet_id.sum()
frequency.sort_values()

p1
Afghan_hound                 0.017241
seat_belt                    0.017241
schipperke                   0.017241
prison                       0.017241
papillon                     0.017241
mousetrap                    0.017241
miniature_pinscher           0.017241
malamute                     0.017241
laptop                       0.017241
ice_bear                     0.017241
hippopotamus                 0.017241
gas_pump                     0.017241
dough                        0.017241
dishwasher                   0.017241
standard_poodle              0.017241
dalmatian                    0.017241
Lakeland_terrier             0.017241
Arabian_camel                0.017241
English_setter               0.017241
Irish_setter                 0.017241
Norwegian_elkhound           0.017241
Saint_Bernard                0.017241
upright                      0.017241
Staffordshire_bullterrier    0.017241
Tibetan_mastiff              0.017241
brown_bear                   0.017241
cash_mach

In [44]:
frequency = df_copy.groupby('p1').count().tweet_id/df_copy.groupby('p1').count().tweet_id.sum()
frequency.sort_values()

p1
grey_fox                     0.000482
hare                         0.000482
harp                         0.000482
hay                          0.000482
hotdog                       0.000482
hummingbird                  0.000482
ibex                         0.000482
ice_lolly                    0.000482
jersey                       0.000482
killer_whale                 0.000482
king_penguin                 0.000482
lacewing                     0.000482
lawn_mower                   0.000482
leaf_beetle                  0.000482
leopard                      0.000482
handkerchief                 0.000482
limousine                    0.000482
long-horned_beetle           0.000482
lorikeet                     0.000482
loupe                        0.000482
lynx                         0.000482
mailbox                      0.000482
maillot                      0.000482
marmot                       0.000482
maze                         0.000482
microphone                   0.000482
microwave

In [45]:
frequency = far_left.groupby('text_length').count().tweet_id/far_left.groupby('text_length').count().tweet_id.sum()
frequency.sort_values()

text_length
69     0.017241
136    0.017241
132    0.017241
130    0.017241
128    0.017241
127    0.017241
126    0.017241
125    0.017241
121    0.017241
118    0.017241
116    0.017241
110    0.017241
109    0.017241
117    0.017241
90     0.017241
101    0.017241
75     0.017241
98     0.017241
97     0.017241
79     0.017241
82     0.017241
85     0.017241
92     0.017241
88     0.017241
138    0.034483
134    0.034483
131    0.034483
108    0.034483
123    0.034483
96     0.034483
120    0.034483
119    0.034483
103    0.034483
107    0.034483
141    0.034483
140    0.051724
135    0.051724
113    0.051724
137    0.051724
Name: tweet_id, dtype: float64

In [46]:
frequency = df_copy.groupby('text_length').count().tweet_id/df_copy.groupby('text_length').count().tweet_id.sum()
frequency.sort_values()

text_length
14     0.000482
146    0.000482
143    0.000482
49     0.000482
44     0.000482
149    0.000482
27     0.000482
33     0.000482
37     0.000482
36     0.000482
51     0.000965
50     0.000965
148    0.000965
46     0.000965
45     0.000965
48     0.001447
142    0.001447
41     0.001447
39     0.001447
59     0.001447
47     0.001930
57     0.001930
40     0.001930
53     0.001930
42     0.001930
52     0.002412
70     0.002412
58     0.002412
66     0.002894
56     0.002894
         ...   
100    0.009648
110    0.009648
121    0.010130
105    0.010130
118    0.010613
119    0.010613
112    0.010613
94     0.010613
132    0.011095
103    0.011095
128    0.011577
107    0.011577
111    0.011577
109    0.012060
108    0.012542
115    0.013507
99     0.013507
131    0.013507
134    0.014954
114    0.016401
113    0.017366
136    0.017849
135    0.019778
137    0.023155
116    0.023637
117    0.027979
138    0.037627
139    0.038591
140    0.068017
141    0.085866
Name: tweet_

In [64]:
frequency = far_left.groupby('is_quote_status').count().tweet_id/far_left.groupby('is_quote_status').count().tweet_id.sum()
frequency.sort_values()


is_quote_status
False    1.0
Name: tweet_id, dtype: float64

In [50]:
df_copy.columns

Index(['tweet_id', 'in_reply_to_status_id_x', 'in_reply_to_user_id_x',
       'timestamp', 'source_x', 'text', 'retweeted_status_id',
       'retweeted_status_user_id', 'retweeted_status_timestamp',
       'expanded_urls', 'rating_numerator', 'rating_denominator', 'name',
       'truncated', 'in_reply_to_screen_name', 'is_quote_status',
       'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang',
       'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf',
       'p2_dog', 'p3', 'p3_conf', 'p3_dog', 'text_length', 'stage'],
      dtype='object')

In [63]:
frequency = df_copy.groupby('is_quote_status').count().tweet_id/df_copy.groupby('is_quote_status').count().tweet_id.sum()
frequency.sort_values()


is_quote_status
False    1.0
Name: tweet_id, dtype: float64

In [68]:
far_left.text_length.describe()

count     58.000000
mean     116.655172
std       19.085727
min       69.000000
25%      104.000000
50%      119.500000
75%      134.000000
max      141.000000
Name: text_length, dtype: float64

In [69]:
df_copy.text_length.describe()


count    2073.000000
mean      112.586589
std        26.261775
min        14.000000
25%        95.000000
50%       117.000000
75%       138.000000
max       149.000000
Name: text_length, dtype: float64

In [71]:
without_far_left = far_left = df_copy[(df_copy['favorite_count']>0) | (df_copy['retweet_count']<=3443)]
without_far_left[['favorite_count','retweet_count']].corr()

Unnamed: 0,favorite_count,retweet_count
favorite_count,1.0,0.912328
retweet_count,0.912328,1.0


In [None]:
df_copy[['favorite_count','retweet_count']].corr()