# Wrangling & Analysis of The WeRateDogs Twitter Archive

### By Chukwuma Festus

## Table of Contents
* Introduction
* Data Wrangling
    * Gathering Data
    * Assessing Data
    * Cleaning Data
* Analyzing and Visualizing Data
* Conclusion

## Introduction

Real-world data rarely comes clean. In this project, we will use Python and its libraries to gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. We will document our wrangling efforts in this Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries).

The dataset that we will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

We will be working with three different datasets in this project:

* Enhanced Twitter Archive: The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. And this data is not clean either.

* Image Predictions File: This is a neural network file that classifies breeds of dogs.It will be downloaded programmatically from a server using the Requests library.

* Additional Data via the Twitter API: We will query Twitter's API to gather valuable data such as retweet count and favorite count which are missing from the Enhanced Twitter archive



After gathering, assessing and cleaning these data, we will perform exploratory data analysis where we will provide answers to questions such as:

* What are the top 5 tweets by favorite count?
* What are the top 5 tweets by retweet count?
* Are retweets and replies correlated?
* Which are the most common dog stage?
* What dog stage receives more likes and retweets?
* What dog stage receives the best ratings?
* What's the distribution for the retweet count?


## Data Wrangling

### Data Gathering

In [1]:
#Import relevant libraries

import pandas as pd
import numpy as np
import os
import tweepy
import requests
import matplotlib.pyplot as plt
import json
import time
import datetime
import altair as alt
import random
from bs4 import BeautifulSoup as bs
from matplotlib import cm
from PIL import Image
from io import BytesIO
%matplotlib inline


Enhanced Twitter Archive data

In [2]:
# load the enhanced twitter archive into a dataframe

df_archive = pd.read_csv('twitter-archive-enhanced.csv')
df_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


Image predictions data

In [3]:
# download the image predictions file programmatically from the server

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
    
response = requests.get(url)
with open(os.path.join(url.split('/')[-1]), mode='wb') as file:
        file.write(response.content)
        
df_predictions = pd.read_csv('image-predictions.tsv', sep='\t') # we use \t because it's a tsv file
df_predictions.head()


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True



Additional data will be downloaded by querying Twitter's API for the JSON data of each tweet using the tweet IDs in the archive. This will be done using Tweepy. The resulting JSON data for each tweet will be stored in file named tweet_json.txt file in a line format and read line by line into a data frame.






In [4]:
# Lets add our keys to query Twitter API (Hide before sharing)

consumer_key = 'zaBZuyfNQvVKvVR0kjgEaghkx'
consumer_secret = 'a4tMS3KZSPtzv9wMYxydkrgcdzmecDOs9y5VujB8HroOc8g1ma'
access_token = '737102246-I5b0QdL2IdBmATKoIyw0ztVSp0JeqbOXqULQB7wZ'
access_secret = 'P3nIkyxLZFglY2K18C3mrnpCYJn8HEQ9RGDbj0uADF25d'


In [5]:
# Create the Twitter API object and set rate limit parameters

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit = True)

In [6]:
# Get a list of tweet_ids from the enhanced twitter archive to use for 
# downloading with the Twitter API

tweet_ids = df_archive.tweet_id.values
print("# of tweet_ids: " + str(len(tweet_ids)) + "\n")

count = 0
tweet_json_list = []
fails_dict = {}

start_time = time.time()

for tweet_id in tweet_ids:
    count += 1
    try:
        # attempt to get the tweet's JSON data and append to the tweet JSON list
        tweet = api.get_status(tweet_id, tweet_mode = 'extended')
        tweet_json_list.append(tweet._json)
    except tweepy.errors.TweepyException as err:
        # save the error to the fail dictionary for review
        print("TweepError for id:  " + str(tweet_id))
        fails_dict[tweet_id] = err
        pass
    # To save space, only print out loop/tweet id for every 100th tweet
    if count % 100 == 0:
        print("loop # " + str(count))

elapsed_time = time.time() - start_time

elapsed_time_str = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print("\nTime elapsed (HH:MM:SS):  " + elapsed_time_str + "\n")

# display list of tweets with errors
print("Number of TweepErrors:  {}\n".format(len(fails_dict)))
for tweet_id in fails_dict:
    print(tweet_id, fails_dict[tweet_id])

# of tweet_ids: 2356

TweepError for id:  888202515573088257
TweepError for id:  873697596434513921
loop # 100
TweepError for id:  872668790621863937
TweepError for id:  872261713294495745
TweepError for id:  869988702071779329
TweepError for id:  866816280283807744
TweepError for id:  861769973181624320
TweepError for id:  856602993587888130
TweepError for id:  856330835276025856
loop # 200
TweepError for id:  851953902622658560
TweepError for id:  851861385021730816
TweepError for id:  845459076796616705
TweepError for id:  844704788403113984
TweepError for id:  842892208864923648
TweepError for id:  837366284874571778
TweepError for id:  837012587749474308
loop # 300
TweepError for id:  829374341691346946
TweepError for id:  827228250799742977
loop # 400
loop # 500
TweepError for id:  812747805718642688
TweepError for id:  802247111496568832
loop # 600
loop # 700
TweepError for id:  779123168116150273
TweepError for id:  775096608509886464
loop # 800
TweepError for id:  771004394259

Rate limit reached. Sleeping for: 670


TweepError for id:  754011816964026368
loop # 1000
loop # 1100
loop # 1200
loop # 1300
loop # 1400
loop # 1500
loop # 1600
loop # 1700
TweepError for id:  680055455951884288


Rate limit reached. Sleeping for: 678


loop # 1800
loop # 1900
loop # 2000
loop # 2100
loop # 2200
loop # 2300

Time elapsed (HH:MM:SS):  00:32:23

Number of TweepErrors:  29

888202515573088257 404 Not Found
144 - No status found with that ID.
873697596434513921 404 Not Found
144 - No status found with that ID.
872668790621863937 404 Not Found
144 - No status found with that ID.
872261713294495745 404 Not Found
144 - No status found with that ID.
869988702071779329 404 Not Found
144 - No status found with that ID.
866816280283807744 404 Not Found
144 - No status found with that ID.
861769973181624320 404 Not Found
144 - No status found with that ID.
856602993587888130 404 Not Found
144 - No status found with that ID.
856330835276025856 404 Not Found
144 - No status found with that ID.
851953902622658560 404 Not Found
144 - No status found with that ID.
851861385021730816 404 Not Found
144 - No status found with that ID.
845459076796616705 404 Not Found
144 - No status found with that ID.
844704788403113984 404 Not Found
14

In [7]:
#Save JSON data to file
tweet_json_file = 'tweet_json.txt'

In [8]:
# save the JSON data in the list to the output file
with open(tweet_json_file, 'w') as outfile:
    for tweet_json in tweet_json_list:
        json.dump(tweet_json, outfile)
        outfile.write('\n')

Read in JSON data to DataFrame Extract the required fields from each tweet's JSON data and store in a separate file, tweet_data_extra.csv, for use during the assessment phase.

In [9]:
# read in the JSON data from the text file, and save to a DataFrame
tweet_json_data = []

with open(tweet_json_file, 'r') as json_file:
    # read the first line to start the loop
    line = json_file.readline()
    while line:
        tweet = json.loads(line)

        # extract variables from the JSON data
        tweet_id = tweet['id']
        retweet_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']
        
        # create a dictionary with the JSON data, then add to a list
        json_data = {'tweet_id': tweet_id, 
                     'retweet_count': retweet_count, 
                     'favorite_count': favorite_count
                    }
        tweet_json_data.append(json_data)

        # read the next line of JSON data
        line = json_file.readline()
        # ----- while -----

# convert the tweet JSON data dictionary list to a DataFrame
df_additional_data = pd.DataFrame(tweet_json_data, 
                                   columns = ['tweet_id',
                                              'retweet_count',
                                              'favorite_count'])

In [10]:
# Read the contents of the created file and create a dataframe with the fields of interest 

df = []
with open('tweet_json.txt') as f:
    for line in f:
        tweet = (json.loads(line))
        tweet_id = tweet['id']
        retweet_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']

        df.append({'retweet_count' : retweet_count,
                  'favorite_count' : favorite_count,
                  'tweet_id' : tweet_id})
        
more_data = pd.DataFrame(df, columns = ['tweet_id', 'retweet_count', 'favorite_count'])

In [11]:
df_additional_data.head(30)

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7009,33796
1,892177421306343426,5301,29314
2,891815181378084864,3481,22041
3,891689557279858688,7217,36898
4,891327558926688256,7760,35272
5,891087950875897856,2599,17793
6,890971913173991426,1663,10352
7,890729181411237888,15752,56808
8,890609185150312448,3620,24515
9,890240255349198849,6098,27938


In [12]:
df_additional_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2327 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2327 non-null   int64
 1   retweet_count   2327 non-null   int64
 2   favorite_count  2327 non-null   int64
dtypes: int64(3)
memory usage: 54.7 KB


In [13]:
df_additional_data.to_csv('additional_data.csv', index = False)

### Data Assessing

#### Enhanced Archive Dataframe
Visual Assessment

In [14]:
df_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [15]:
df_archive.tail()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,


Programmatic Assessment

In [16]:
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

Observations:

Quality issues
* There are missing values in the in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls columns
* The timestamp column is in string (object) format instead of datetime
* The source column has irrelevant HTML code
* Name column has inconsistent/invalid data such as a, None
* Some rows have several identical values in the expanded_url column concatenated by a comma
* Tweet_id fields in the dataset are stored as numeric values and should be strings
* The values for the column "floffer" should not be capitalized to maintain integrity with the others

Tidiness
* There are four columns for dog stages instead of one
* The dog's rating appears both in the "text" column and the rating numerator/denominator columns
* The link to the dog's photo is part of the status text and should have own column

In [17]:
#Let us investigate the name columns further
df_archive.name.value_counts().head(20)

None       745
a           55
Charlie     12
Cooper      11
Lucy        11
Oliver      11
Tucker      10
Penny       10
Lola        10
Winston      9
Bo           9
Sadie        8
the          8
Daisy        7
Buddy        7
Toby         7
an           7
Bailey       7
Leo          6
Oscar        6
Name: name, dtype: int64

We can see aside None, other invalid names like a, the, an begin with small letters. We'll see look for names that begin with small letters and confirm our hunch.

In [18]:
invalid_names = df_archive.name.str.contains('^[a-z]', regex = True)
df_archive[invalid_names].name.value_counts().sort_index()

a               55
actually         2
all              1
an               7
by               1
getting          2
his              1
incredibly       1
infuriating      1
just             4
life             1
light            1
mad              2
my               1
not              2
officially       1
old              1
one              4
quite            4
space            1
such             1
the              8
this             1
unacceptable     1
very             5
Name: name, dtype: int64

In [19]:
#confirming the total number of invalid names and missing
print(len(df_archive[invalid_names]) + len(df_archive[df_archive.name == 'None']))

854


The enhanced archive contains 854 names that are invalid/missing

In [20]:
df_archive.isnull()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,False,True,True,False,False,False,True,True,True,False,False,False,False,False,False,False,False
1,False,True,True,False,False,False,True,True,True,False,False,False,False,False,False,False,False
2,False,True,True,False,False,False,True,True,True,False,False,False,False,False,False,False,False
3,False,True,True,False,False,False,True,True,True,False,False,False,False,False,False,False,False
4,False,True,True,False,False,False,True,True,True,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,False,True,True,False,False,False,True,True,True,False,False,False,False,False,False,False,False
2352,False,True,True,False,False,False,True,True,True,False,False,False,False,False,False,False,False
2353,False,True,True,False,False,False,True,True,True,False,False,False,False,False,False,False,False
2354,False,True,True,False,False,False,True,True,True,False,False,False,False,False,False,False,False


In [21]:
#Le's check for duplicates 
sum(df_archive.duplicated())

0

In [22]:
#confirming that indeed there are duplicates using the tweet_ID which is unique
df_archive[df_archive.tweet_id.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [23]:
# let's look at the expanded URLs column which we found to have some missing data
# A link in the tweet's text and no expanded URL signifies quality issues

check_url = df_archive[df_archive['expanded_urls'].isnull()]
check_url['text'].str.contains('http').sum()

0

No issue there. Every tweet with an url has an expanded URL. Those without one are usually retweets or replies

In [24]:
# value counts for the source column
df_archive.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [25]:
# every numerator not other than 10-14 in the dogs rating is a possible quality issue
df_archive.rating_numerator.value_counts().sort_index(ascending=False)

1776      1
960       1
666       1
420       2
204       1
182       1
165       1
144       1
143       1
121       1
99        1
88        1
84        1
80        1
75        2
60        1
50        1
45        1
44        1
27        1
26        1
24        1
20        1
17        1
15        2
14       54
13      351
12      558
11      464
10      461
9       158
8       102
7        55
6        32
5        37
4        17
3        19
2         9
1         9
0         2
Name: rating_numerator, dtype: int64

In [26]:
# every denominator not equal to 10 in the dogs rating is a possible quality issue
df_archive.rating_denominator.value_counts().sort_index(ascending=False)

170       1
150       1
130       1
120       1
110       1
90        1
80        2
70        1
50        3
40        1
20        2
16        1
15        1
11        3
10     2333
7         1
2         1
0         1
Name: rating_denominator, dtype: int64

In [27]:
# we check for duplicated IDs in the feeds dataframe
df_archive[df_archive.tweet_id.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


Using the tweet_ID we confirm that there are no duplicates

#### Predictions Dataset

In [28]:
df_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [29]:
df_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [30]:
df_predictions.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


The predictions dataframe is the only one with numerical data. With .describe() we can check whether the ranges, means and quartiles make sense. In this test, we correctly see that the probability for the first guess has the largest value. We also see that the highest probability is equal to 1, which is expected.

Observations:

Quality issues:

* There are no real quality issues here

Tidiness issues: 

* Predictions, confidence intervals and dog tests are all spread in three different columns

#### Additional Data dataset

In [31]:
df_additional_data.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7009,33796
1,892177421306343426,5301,29314
2,891815181378084864,3481,22041
3,891689557279858688,7217,36898
4,891327558926688256,7760,35272


In [32]:
df_additional_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2327 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2327 non-null   int64
 1   retweet_count   2327 non-null   int64
 2   favorite_count  2327 non-null   int64
dtypes: int64(3)
memory usage: 54.7 KB


In [33]:
# Checking for duplicated columns by creating a list of every column in every dataset
all_columns = pd.Series(list(df_archive) + list(df_predictions) + list(df_additional_data))
# all_columns
all_columns[all_columns.duplicated()]

17    tweet_id
29    tweet_id
dtype: object

The tweet_id column is the only column that is duplicated among the three data sets which is fine.

Observations:

Quality Issues:

* The retweet and favorite count columns have float instead of int as datatype

Tidiness Issues:

* The dataset is split from the main dataset describing tweets

## Data Cleaning

To begin cleaning of the identified issues, we first start by creating copies of the original data frames

In [163]:
# create copies of
df_archive_clean = df_archive.copy()
df_predictions_clean = df_predictions.copy()
df_additional_data_clean = df_additional_data.copy()

#### Some of the tweets are retweets and replies

Define

Remove all observations in the df_archive_clean dataset that have values in the rows in_reply_to_status_id or retweeted_status_id. Then remove those two columns plus retweeted_status_user_id, retweeted_status_timestamp and in_reply_to_user_id



Code

In [164]:
df_archive_clean = df_archive_clean[(df_archive_clean['in_reply_to_status_id'].isna() == True)  & (df_archive_clean['retweeted_status_id'].isna() == True)]

df_archive_clean = df_archive_clean.drop(['in_reply_to_status_id',
                              'in_reply_to_user_id',
                              'retweeted_status_id',
                              'retweeted_status_user_id',
                              'retweeted_status_timestamp'],
                            axis = 1)

Test

In [165]:
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2097 non-null   int64 
 1   timestamp           2097 non-null   object
 2   source              2097 non-null   object
 3   text                2097 non-null   object
 4   expanded_urls       2094 non-null   object
 5   rating_numerator    2097 non-null   int64 
 6   rating_denominator  2097 non-null   int64 
 7   name                2097 non-null   object
 8   doggo               2097 non-null   object
 9   floofer             2097 non-null   object
 10  pupper              2097 non-null   object
 11  puppo               2097 non-null   object
dtypes: int64(3), object(9)
memory usage: 213.0+ KB


#### Incorrect rating numerators. Numerators are a 10 but in some cases we have 13.5

Define

Change datatype and use regex to extract the ratings numerators again.

Code

In [166]:
df_archive_clean['rating_numerator'] = df_archive_clean['text'].str.extract('(\d+\.*\d*\/\d+)', expand=False).str.split('/').str[0]

Test

In [167]:
df_archive_clean.loc[df_archive_clean['tweet_id'] == 883482846933004288, 'rating_numerator']

45    13.5
Name: rating_numerator, dtype: object

In [168]:
print(df_archive_clean.rating_numerator.dtype)

object


#### Data about dogs and tweets contained in same dataset

Define

In order to have observational units in their own table, we need to split the df_archive_clean data-set into a tweets table and a dogs table. This is a design decision because there are many several tweets that include information and ratings for more than one dog. Ratings and dog stages refer to dogs, whereas retweets an likes are related to the tweets.

In [169]:
dogs = df_archive_clean[['tweet_id', 'name', 'doggo', 'floofer', 'pupper', 'puppo', 'rating_numerator', 'rating_denominator']].copy()
tweets = df_archive_clean.drop(['name', 'doggo', 'floofer', 'pupper', 'puppo', 'rating_numerator', 'rating_denominator'], axis=1)

Test

In [170]:
print(tweets.columns)
print(dogs.columns)

Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls'], dtype='object')
Index(['tweet_id', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'rating_numerator', 'rating_denominator'],
      dtype='object')


#### Dog stage is in four different column

Define

We have values as variables in the now called dogs table. We will melt the columns doggo, floofer, pupper, puppo into a new variable called dog_stage. Before melting, create a new column to identify those dogs with dog stage 'unknown'

Code

In [171]:
# we want to create first an extra column to tag those dogs without a recognized stage. 
# This will help cleaning the dataset afterwards
def unknown(row):
    if row ['doggo'] == 'None' and row ['floofer'] == 'None' and row ['pupper'] == 'None' and row ['puppo'] == 'None':
        val = 'unknown'
    else:
        val ='None'
    return val

dogs['unknown'] = dogs.apply(unknown, axis=1)

In [172]:
dogs.sample()

Unnamed: 0,tweet_id,name,doggo,floofer,pupper,puppo,rating_numerator,rating_denominator,unknown
122,869227993411051520,Gizmo,,,,,13,10,unknown


In [173]:
# we melt the different dog stages into a new column calle 'dog_stage'
dogs = pd.melt(dogs, id_vars =['tweet_id', 'name', 'rating_numerator','rating_denominator'],
                     value_vars = ['doggo', 'floofer', 'pupper', 'puppo', 'unknown'],
                     var_name = 'dog_stage', 
                    value_name = 'value')

In [174]:
# clean the duplicated rows created in the preious process and drop the 'value' variable
dogs = dogs[dogs['value']!= 'None']
dogs = dogs.drop('value', axis=1)

Test

In [175]:
dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2108 entries, 9 to 10484
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2108 non-null   int64 
 1   name                2108 non-null   object
 2   rating_numerator    2108 non-null   object
 3   rating_denominator  2108 non-null   int64 
 4   dog_stage           2108 non-null   object
dtypes: int64(2), object(3)
memory usage: 98.8+ KB


It is not unusual that now we have more records (2,108 vs 2,097), because some tweets referred to more than one dog and included more than one dog stage

#### Predictions are in three different columns

Define

The different predictions, the confidence and the test whether the prediction is a dog or not need to be melted into four columns: prediction_number, prediction, confidence and dog

Code

In [176]:
# This script iterates the melting process for the three categories of data: prediction, confidence and dog test. 
# at the end we have a clean tidy dataset.

# create 'prediction_number' and 'prediction' columns
df_predictions_clean = pd.melt(df_predictions_clean, id_vars = ['tweet_id', 'jpg_url', 'img_num', 'p1_conf', 'p1_dog',
       'p2_conf', 'p2_dog', 'p3_conf', 'p3_dog'],
               var_name = 'prediction_number',
               value_name = 'prediction')

# create 'confidence' column
df_predictions_clean = pd.melt(df_predictions_clean, id_vars = ['tweet_id', 'jpg_url', 'img_num', 'p1_dog',
       'p2_dog', 'p3_dog', 'prediction_number', 'prediction'],
               var_name = 'to_delete',
               value_name = 'confidence')

# remove newly created duplicated rows
df_predictions_clean = df_predictions_clean[df_predictions_clean['prediction_number'] == df_predictions_clean['to_delete'].str[:2]]
# remove unnecesary column
df_predictions_clean = df_predictions_clean.drop('to_delete', axis=1)

# create 'dog' column
df_predictions_clean = pd.melt(df_predictions_clean, id_vars = ['tweet_id', 'jpg_url', 'img_num','prediction_number', 'prediction', 'confidence'],
               var_name = 'to_delete2',
               value_name = 'dog')

# remove newly created duplicated rows
df_predictions_clean = df_predictions_clean[df_predictions_clean['prediction_number'] == df_predictions_clean['to_delete2'].str[:2]]

# remove unnecesary column
df_predictions_clean = df_predictions_clean.drop('to_delete2', axis=1)
# remove 'p' from prediction number
df_predictions_clean['prediction_number'] = df_predictions_clean['prediction_number'].str[1]


Test

In [177]:
df_predictions_clean.sort_values(by = ['tweet_id', 'prediction_number']).head(10)

Unnamed: 0,tweet_id,jpg_url,img_num,prediction_number,prediction,confidence,dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,1,Welsh_springer_spaniel,0.465074,True
8300,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,2,collie,0.156665,True
16600,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,3,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,1,redbone,0.506826,True
8301,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,2,miniature_pinscher,0.074192,True
16601,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,3,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,1,German_shepherd,0.596461,True
8302,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,2,malinois,0.138584,True
16602,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,3,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,1,Rhodesian_ridgeback,0.408143,True


In [178]:
df_predictions_clean.shape

(6225, 7)

#### The additional_data dataset is split from the main dataset describing tweets

Define

Merge df_additional_data_clean into tweets

Code

In [179]:
# merge both dataframes on 'tweet_id'. Left join to preserve all tweets, independent of whether retweet info
# could be retrieved through twitter's API

tweets = tweets.merge(df_additional_data_clean, how = 'left', on = 'tweet_id')

Test

In [180]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2096
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tweet_id        2097 non-null   int64  
 1   timestamp       2097 non-null   object 
 2   source          2097 non-null   object 
 3   text            2097 non-null   object 
 4   expanded_urls   2094 non-null   object 
 5   retweet_count   2089 non-null   float64
 6   favorite_count  2089 non-null   float64
dtypes: float64(2), int64(1), object(4)
memory usage: 131.1+ KB


#### Irrelevant HTML code in source column

Define

Remove the HTML from the source column in the feed dataframe, leaving only the clear name of the source. Use Beautiful Soup because the text is the contents of an HTML tag.

In [181]:
# Iterate through each row and extract the source's text with beautiful soup

new_source = []
for line, row in tweets.iterrows():
    soup = bs(row.source)
    x = soup.find('a').contents[0]
    new_source.append(x)
    
tweets['source'] = new_source

Test

In [182]:
pd.Series(new_source).value_counts()

Twitter for iPhone     1964
Vine - Make a Scene      91
Twitter Web Client       31
TweetDeck                11
dtype: int64

#### Time column is not in datetime format

Define

Change timestamp datatype in tweets dataframe from object to datatime type

Code

In [183]:
tweets['timestamp'] = pd.to_datetime(tweets['timestamp'])

Test

In [184]:
tweets.dtypes

tweet_id                        int64
timestamp         datetime64[ns, UTC]
source                         object
text                           object
expanded_urls                  object
retweet_count                 float64
favorite_count                float64
dtype: object

#### The rating for dog in some tweet IDs are wrong

Define

Look up for rating in tweet's text and update rating in dogs table

Code

In [185]:
dogs.loc[dogs['tweet_id'] == 716439118184652801, 'rating_numerator'] = 11
dogs.loc[dogs['tweet_id'] == 716439118184652801, 'rating_denominator'] = 10

dogs.loc[dogs['tweet_id'] == 722974582966214656, 'rating_numerator'] = 13
dogs.loc[dogs['tweet_id'] == 722974582966214656, 'rating_denominator'] = 10

dogs.loc[dogs['tweet_id'] == 682962037429899265, 'rating_numerator'] = 10
dogs.loc[dogs['tweet_id'] == 682962037429899265, 'rating_denominator'] = 10

Test

In [186]:
ids = [716439118184652801, 722974582966214656, 682962037429899265]
dogs.loc[dogs['tweet_id'].isin(ids), ['tweet_id','rating_numerator', 'rating_denominator']]

Unnamed: 0,tweet_id,rating_numerator,rating_denominator
9336,722974582966214656,13,10
9373,716439118184652801,11,10
9814,682962037429899265,10,10


#### Many dog names are incorrect, including those with id 740373189193256964 and 770414278348247044

Define

Look up and correct the dog's names in the dogs table

Code

In [187]:
dogs.loc[dogs['tweet_id'] == 887517139158093824, 'name'] = 'None'
dogs.loc[dogs['tweet_id'] == 770414278348247044, 'name'] = 'Al Cabone'

Test

In [188]:
ids = [887517139158093824, 770414278348247044]
dogs.loc[dogs['tweet_id'].isin(ids), ['tweet_id','name']]

Unnamed: 0,tweet_id,name
8409,887517139158093824,
9020,770414278348247044,Al Cabone


#### The dog stage for dog in tweet with id 854010172552949760 is incorrect

Define

Search in dog table and correct

Code

In [189]:
dogs.loc[dogs['tweet_id'] == 854010172552949760, 'dog_stage'] = 'floofer'
dogs.drop(2258, inplace = True)

Test

In [190]:
dogs.loc[dogs['tweet_id'] == 854010172552949760 , ['tweet_id','dog_stage']]

Unnamed: 0,tweet_id,dog_stage
161,854010172552949760,floofer


#### Separate URL in expanded URL column

Define

Split value to keep only the first URL in the column expanded_urls in the tweets dataset. Rename column to expanded_url

Code

In [191]:
tweets['expanded_urls'] = tweets['expanded_urls'].str.split(',', expand=True)[0]

tweets = tweets.rename(index=str, columns={"expanded_urls": "expanded_url"})

Test

In [192]:
tweets['expanded_url'].str.contains(',').sum()

0

#### Tweet IDs are in integer data type

Define

Change datatypes for the tweet_id variable from int to object in the three datasets

Code

In [193]:
tweets['tweet_id'] = tweets['tweet_id'].astype(str)
dogs['tweet_id'] = dogs['tweet_id'].astype(str)
df_predictions_clean['tweet_id'] = df_predictions_clean['tweet_id'].astype(str)

Test

In [194]:
print(tweets.tweet_id.dtype)
print(dogs.tweet_id.dtype)
print(df_predictions_clean.tweet_id.dtype)

object
object
object


#### Retweet and count are in float format

Define

Change the datatypes for the retweet_count and favorite_count variables from float to int in tweets dataset

Code

In [195]:
# I need to start by changing NaNs into 0, for the dtype conversion to int to work
tweets['retweet_count'] = tweets['retweet_count'].fillna(0).astype(int)
tweets['favorite_count'] = tweets['favorite_count'].fillna(0).astype(int)
tweets['retweet_count'] = tweets['retweet_count'].astype(int)
tweets['favorite_count'] = tweets['favorite_count'].astype(int)

Test

In [196]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2097 entries, 0 to 2096
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   tweet_id        2097 non-null   object             
 1   timestamp       2097 non-null   datetime64[ns, UTC]
 2   source          2097 non-null   object             
 3   text            2097 non-null   object             
 4   expanded_url    2094 non-null   object             
 5   retweet_count   2097 non-null   int32              
 6   favorite_count  2097 non-null   int32              
dtypes: datetime64[ns, UTC](1), int32(2), object(4)
memory usage: 114.7+ KB


#### The name Floffer in Dog stage begins with capital letter which is inconsistent with other names in the dog stage columns

Define

Apply string methods to un-capitalize Floffer in the dog_stage values of the dogs dataset

Code

In [197]:
dogs['dog_stage'] = dogs['dog_stage'].str.lower()

Test

In [198]:
dogs['dog_stage'].value_counts()

unknown    1761
pupper      230
doggo        82
puppo        24
floofer      10
Name: dog_stage, dtype: int64

#### Change all lowercase names to None for uniformity

Define

Identify these names and and change to None

Code

In [199]:
dogs.loc[(dogs.name.str.contains('^[a-z]', regex = True)), 'name'] = "None"

Test

In [201]:
dogs.name.value_counts()

None          713
Lucy           11
Charlie        11
Cooper         10
Oliver         10
             ... 
Davey           1
Fizz            1
Dixie           1
Al Cabone       1
Christoper      1
Name: name, Length: 930, dtype: int64


We will now save the cleaned datasets and proceed to analysis. The goal here is not to clean the entire data set as it is super dirty but to get clean enough data for our intended analysis.


In [202]:
# save the three tables with independent observational units
tweets.to_csv('tweets.csv', index=False)
dogs.to_csv('dogs.csv', index=False)
df_predictions_clean.to_csv('predictions.csv',index=False)

In [210]:
# create and save a master dataset. The key needs to be set to 'tweet_id' in order for the join to work.
twitter_archive_master = dogs.set_index('tweet_id').join(tweets.set_index('tweet_id'), on='tweet_id', how='left')
twitter_archive_master.to_csv('twitter_archive_master.csv', index=False)

In [211]:
twitter_archive_master.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2107 entries, 890240255349198849 to 666020888022790149
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   name                2107 non-null   object             
 1   rating_numerator    2107 non-null   object             
 2   rating_denominator  2107 non-null   int64              
 3   dog_stage           2107 non-null   object             
 4   timestamp           2107 non-null   datetime64[ns, UTC]
 5   source              2107 non-null   object             
 6   text                2107 non-null   object             
 7   expanded_url        2104 non-null   object             
 8   retweet_count       2107 non-null   int32              
 9   favorite_count      2107 non-null   int32              
dtypes: datetime64[ns, UTC](1), int32(2), int64(1), object(6)
memory usage: 164.6+ KB


## Analyzing & Visualizing the Data

Here we will answer questions posed in the introduction section

#### What are the top 5 tweets by favourite count?

In [212]:
df_additional_data_clean.sort_values(by = 'favorite_count', ascending = False).head(5)

Unnamed: 0,tweet_id,retweet_count,favorite_count
1011,744234799360020481,70735,144787
395,822872901745569793,39920,124012
515,807106840509214720,51677,111612
129,866450705531457537,30210,108840
1051,739238157791694849,52897,107168


#### What are the top 5 tweets by retweet counts?

In [213]:
df_additional_data_clean.sort_values(by = 'retweet_count', ascending = False).head(5)

Unnamed: 0,tweet_id,retweet_count,favorite_count
1011,744234799360020481,70735,144787
1051,739238157791694849,52897,107168
515,807106840509214720,51677,111612
395,822872901745569793,39920,124012
65,879415818425184262,37411,92803


#### Are retweets and likes correlated?

In [214]:
master_tweets = pd.read_csv('twitter_archive_master.csv')

In [215]:
chart = alt.Chart(tweets).mark_point()

In [216]:
alt.Chart(master_tweets[master_tweets['dog_stage']!= 'unknown']).mark_point(opacity = 0.5).encode(
    alt.X('retweet_count',scale=alt.Scale(type='log'), bin=True),
    alt.Y('favorite_count'), 
    color = 'dog_stage').properties(title = 'Retweets and Likes are strongly correlated')

#### Which is the most common dog stage?

In [217]:
alt.Chart(master_tweets[master_tweets['dog_stage']!= 'unknown']).mark_bar(color='blue', opacity=0.5).encode(
    alt.Y('dog_stage', axis=alt.Axis(title='Dog Stage')),
    x='count()'
).properties(title = 'Most dogs rated are puppers')

#### What dog stage receives most likes and retweets?

In [218]:
alt.Chart(master_tweets[master_tweets['dog_stage']!= 'unknown']).mark_bar(color='firebrick', opacity=0.5).encode(
    alt.Y('dog_stage', axis=alt.Axis(title='Dog Stage')),
    alt.X('average(favorite_count)', axis=alt.Axis(title='Average Like Count'))
).properties(title = 'People really like puppos')

In [219]:
alt.Chart(master_tweets[master_tweets['dog_stage']!= 'unknown']).mark_bar(color='firebrick', opacity=0.5).encode(
    alt.Y('dog_stage', axis=alt.Axis(title='Dog Stage')),
    alt.X('average(retweet_count)', axis=alt.Axis(title='Average Retweet Count'))
).properties(title = 'People really like puppos')

#### Which dog stage receives the best rating?

In [220]:
alt.Chart(master_tweets[master_tweets['dog_stage']!= 'unknown']).mark_bar(color='green', opacity=0.5).encode(
    alt.Y('dog_stage', axis=alt.Axis(title='Dog Stage')),
    alt.X('average(rating_numerator)', axis=alt.Axis(title='Average Rating'))
).properties(title = 'Best rated dogs are puppos')

#### What is the distribution of retweet counts?

In [221]:
base = alt.Chart(master_tweets)

bar = base.mark_bar(color='lightblue', opacity=0.9).encode(
    alt.X('favorite_count', axis=alt.Axis(title='Likes'),
          bin=alt.Bin(maxbins=100, extent=([0,50000]))),
    y='count()'
).properties(title = 'Followers engage a lot with almost every WeRateDogs tweet ')

rule = base.mark_rule(color='red').encode(
    x='mean(favorite_count):Q',
    size=alt.value(2)
)

bar + rule

In [222]:
base = alt.Chart(master_tweets)

bar = base.mark_bar(color='lightblue', opacity=0.9).encode(
    alt.X('retweet_count', axis=alt.Axis(title='Retweets'),
          bin=alt.Bin(maxbins=100, extent=([0,16000]))),
    y='count()'
).properties(title = 'Followers engage a lot with almost every WeRateDogs tweet ')

rule = base.mark_rule(color='red').encode(
    x='mean(retweet_count):Q',
    size=alt.value(2)
)

bar + rule

## Conclusion

From the wrangling efforts put into this project and analysis of the cleaned data we can see that:

* It can be seen that tweetID 744234799360020481 had the highest favorite count and retweet count
* Retweets and likes are strongly correlated
* Puppos was the dog stage whith the retweets and likes. It also has had the highest rating from people
* The WeRateDogs Twitter page had lots of interaction from Twitter users

    
    

## Credit

Pandas sort_values(): http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html

Sort values issue in SO: https://stackoverflow.com/questions/44123874/dataframe-object-has-no-attribute-sort

Reading and writing JSON to a file: https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/

SQL Alchemy: https://www.sqlalchemy.org/

Pillow Documentation: https://pillow.readthedocs.io/en/stable/

API tutorial of Mediawiki: https://www.mediawiki.org/wiki/API:Tutorial

Requests Documentation: http://docs.python-requests.org/en/master/user/intro/

Glob documentation: https://docs.python.org/3/library/glob.html

Python Files I/O: https://www.tutorialspoint.com/python/python_files_io.htm

Assessing a tweet with only its ID: https://www.bram.us/2017/11/22/accessing-a-tweet-using-only-its-id-and-without-the-twitter-api/

Reading and writing files in Python: https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python

Test for a pattern in a Pandas dataframe: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html

Tidy data: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Imputation concepts for filling missing data: https://en.wikipedia.org/wiki/Imputation_(statistics)

Melt function in pandas: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.melt.html#pandas.DataFrame.melt

Commenting several lines of code in Jupyter Notebooks: https://stackoverflow.com/questions/29885371/how-do-i-comment-out-multiple-lines-in-jupyter-ipython-notebook

Extracting specific columns into a new dataframe: https://stackoverflow.com/questions/34682828/extracting-specific-selected-columns-to-new-dataframe-as-a-copy

How to iterate through a pandas series: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas and https://stackoverflow.com/questions/43222878/iterate-over-pandas-dataframe-and-update-the-value-attributeerror-cant-set-a

Work with strings in pandas: https://pandas.pydata.org/pandas-docs/stable/text.htm

Merging datasets in pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

Pandas .isin() function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html

Using .loc and .iloc to slice in pandas: https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/