In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


# Project: Analysis on the WeRateDogs Twitter Archive

## Table of Contents
- Introduction
    - <a href = "#imports"> Notebook Imports </a>
    - <a href = "#constants"> Constants </a>
    - <a href = "#functions"> Function Definitions </a>
- Data Wrangling
    - <a href = "#gathering"> Data Gathering </a>
    - <a href = "#assessing"> Data Assessing </a>
    - <a href = "#cleaning"> Data Cleaning </a>
- <a href = "#analysis"> Data Analysis and Visualization </a>
- Conclusions


## Introduction

> WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. People may want to buy dogs that are humorous or  otherwise. The goal of the project is to analyze and know the humour level of each breed of dog. Each dog is rated differently based on humour level with a common rating denominator of 10 but with _weird_ levels of rating numerator, some higher than 10, but that is all part of the fun of the `@dog_rates` twitter account.

> Only eight (8) quality issues and two (2) tidiness issues, at minimum, were to be identified and addressed due to the robustness of the data.

> Likely questions to be asked include:
> - <a href="#q1">Do the ratings influence the likes and retweets?</a>
> - <a href="#q2">Is There a Relationship Between Budget and Revenue?</a>
> - <a href="#q3">Do the Cast Influence the Revenue?</a>
> - <a href="#q4">Do the Directors Influence the Revenue?</a>

<a id='imports'></a>
### Notebook Imports

In [2]:
import pandas as pd  # For Data Manipulation
import numpy as np  # For Array Manipulation

import os  # For File Access
import requests  # For Web Access

import tweepy  # For Accessing Twitter's API
import json # To manipulate JSON files

import matplotlib.pyplot as matpy  # For Data Visualization
import seaborn as sb  # For Data Visualization

<a id='constants'></a>
### Constants

In [3]:
IMAGE_URL = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
IMAGE_FILE = IMAGE_URL.split("/")[-1]
API_FILE = "twitter_json.txt"

API_KEY = input("Enter API key: ")
API_SECRET = input("Enter API secret: ")
ACCESS_TOKEN = input("Enter access token: ")
ACCESS_SECRET = input("Enter access secret: ")

Enter API key: 
Enter API secret: 
Enter access token: 
Enter access secret: 


<a id='functions'></a>
### Function Definitions

In [4]:
def summary(df):
    """
    Returns a description of the DataFrame
    """
    info = df.info()
    isna = df.isna().sum()
    duplicates = df.duplicated().any().sum()
    
    print(f"\n \n Duplicates = {duplicates}\n")
    return isna 

In [5]:
def clean_test(df):
    """
    Returns the info and head of the cleaned data frame
    """
    info = df.info()
    head = df.sample(5)
    
    return head

## Data Wrangling

<a id='gathering'></a>
#### Data Gathering
1. Getting the data from a file.
2. Downloading the file programmatically and loading it.
3. Getting the data from Twitter's API.

In [6]:
# Reading files
tweets = pd.read_csv("twitter-archive-enhanced.csv")

In [7]:
# Downloading file
response = requests.get(IMAGE_URL)

# os.getcwd() gets your current working directory
with open(os.path.join(os.getcwd(), IMAGE_FILE), "wb") as file:
    file.write(response.content)

In [6]:
# Checking the contents of the working directory
os.listdir(os.getcwd())

['.ipynb_checkpoints',
 'image-predictions.tsv',
 'twitter-api.py',
 'twitter-archive-enhanced.csv',
 'WeRateDogs Twitter Archive.ipynb']

In [7]:
# Reading the downloaded file
images = pd.read_csv(IMAGE_FILE, sep = "\t")

In [7]:
# Authenticating the API
auth = tweepy.OAuth1UserHandler(API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True)
api.verify_credentials()

User(_api=<tweepy.api.API object at 0x000001C492334850>, _json={'id': 1466314425569325058, 'id_str': '1466314425569325058', 'name': 'n.e.d.u.m', 'screen_name': 'n_e_d_u_m', 'location': '', 'description': 'Python Developer | Data Analyst', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 68, 'friends_count': 60, 'listed_count': 0, 'created_at': 'Thu Dec 02 07:55:26 +0000 2021', 'favourites_count': 371, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 48, 'lang': None, 'status': {'created_at': 'Mon Jun 13 02:12:36 +0000 2022', 'id': 1536169635430682624, 'id_str': '1536169635430682624', 'text': 'Numpy, pandas matplotlib https://t.co/UFJkjkiaxG', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/UFJkjkiaxG', 'expanded_url': 'https://twitter.com/oliverjumpertz/status/1535941015437291521', 'display_url': 'twitter.com/oliverjumpertz…'

In [8]:
tweet_ids = tweets["tweet_id"].values
print(len(tweet_ids))
tweet_ids

2356


array([892420643555336193, 892177421306343426, 891815181378084864, ...,
       666033412701032449, 666029285002620928, 666020888022790149],
      dtype=int64)

In [9]:
%%time  # To calculate the time it will take to run the code 
failed = {}  # To store the failed retrievals
step = 0

# Opening the file in write mode
with open(API_FILE, "w") as file:
    # looping through each id
    for ids in tweet_ids:  
        step += 1
        if step % 5 == 0:
            print(f"{step}: {ids}")
        
        # Fetching the data
        try:
            tweet = api.get_status(ids, tweet_mode='extended')
            json.dump(tweet._json, file)
            file.write('\n')
            
        # Storing the errors
        except tweepy.errors.TweepyException as tte:
            failed[ids] = tte
            pass

print(failed)
len(failed)

5: 891327558926688256
10: 890240255349198849
15: 889531135344209921
20: 888202515573088257
25: 887343217045368832
30: 886366144734445568
35: 885528943205470208
40: 884876753390489601
45: 883838122936631299
50: 882762694511734784
55: 881666595344535552
60: 880872448815771648
65: 879674319642796034
70: 879050749262655488
75: 878316110768087041
80: 877556246731214848
85: 876484053909872640
90: 875021211251597312
95: 874012996292530176
100: 872967104147763200
105: 872261713294495745
110: 871166179821445120
115: 870656317836468226
120: 869772420881756160
125: 868639477480148993
130: 867421006826221569
135: 866686824827068416
140: 865359393868664832
145: 863907417377173506
150: 863079547188785154
155: 862096992088072192
160: 860981674716409858
165: 860184849394610176
170: 859196978902773760
175: 858107933456039936
180: 857214891891077121
185: 856526610513747968
190: 855860136149123072
195: 855245323840757760
200: 854120357044912130
205: 852936405516943360
210: 852226086759018497
215: 8514648

1690: 681340665377193984
1695: 681281657291280384
1700: 680970795137544192
1705: 680889648562991104
1710: 680609293079592961
1715: 680440374763077632
1720: 680161097740095489
1725: 680085611152338944
1730: 679862121895714818
1735: 679736210798047232
1740: 679511351870550016
1745: 679158373988876288
1750: 679047485189439488
1755: 678798276842360832
1760: 678740035362037760
1765: 678424312106393600
1770: 678380236862578688
1775: 678023323247357953
1780: 677716515794329600
1785: 677662372920729601
1790: 677547928504967168
1795: 677328882937298944
1800: 677187300187611136
1805: 676946864479084545
1810: 676864501615042560
1815: 676617503762681856
1820: 676590572941893632
1825: 676496375194980353
1830: 676237365392908289
1835: 676121918416756736
1840: 675891555769696257
1845: 675849018447167488
1850: 675781562965868544
1855: 675534494439489536
1860: 675497103322386432
1865: 675362609739206656
1870: 675153376133427200
1875: 675135153782571009
1880: 675015141583413248
1885: 674800520222154752


In [8]:
os.listdir(os.getcwd())

['.ipynb_checkpoints',
 'image-predictions.tsv',
 'twitter-api.py',
 'twitter-archive-enhanced.csv',
 'twitter_json.txt',
 'WeRateDogs Twitter Archive.ipynb']

In [8]:
# Creatig a list to hold the cleaned dictionaries
api_list = [json.loads(line) for line in open(API_FILE, "r")]
api_list[0]

{'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
 'truncated': False,
 'display_text_range': [0, 85],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'medium': {'w': 540, 'h': 528, 'resize': 'fit'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'large': {'w': 

In [9]:
api_df = []  # List to hold all the dictionaries for the data frame
# Looping through the clean dictionaries
for dicts in api_list:
    # Gathering entities
    tweet_id = dicts["id"]
    likes = dicts["favorite_count"]
    retweets = dicts["retweet_count"]
    
    # Creating the data frame dictionary
    api_dict = {
        "tweet_id": tweet_id, 
        "likes": likes, 
        "retweets": retweets
    }

    api_df.append(api_dict)
        
api_df

[{'tweet_id': 892420643555336193, 'likes': 33823, 'retweets': 7010},
 {'tweet_id': 892177421306343426, 'likes': 29337, 'retweets': 5302},
 {'tweet_id': 891815181378084864, 'likes': 22069, 'retweets': 3482},
 {'tweet_id': 891689557279858688, 'likes': 36948, 'retweets': 7228},
 {'tweet_id': 891327558926688256, 'likes': 35321, 'retweets': 7764},
 {'tweet_id': 891087950875897856, 'likes': 17815, 'retweets': 2602},
 {'tweet_id': 890971913173991426, 'likes': 10368, 'retweets': 1666},
 {'tweet_id': 890729181411237888, 'likes': 56882, 'retweets': 15759},
 {'tweet_id': 890609185150312448, 'likes': 24531, 'retweets': 3624},
 {'tweet_id': 890240255349198849, 'likes': 27969, 'retweets': 6101},
 {'tweet_id': 890006608113172480, 'likes': 27047, 'retweets': 6154},
 {'tweet_id': 889880896479866881, 'likes': 24570, 'retweets': 4168},
 {'tweet_id': 889665388333682689, 'likes': 42073, 'retweets': 8353},
 {'tweet_id': 889638837579907072, 'likes': 23683, 'retweets': 3718},
 {'tweet_id': 889531135344209921,

In [10]:
likes = pd.DataFrame(api_df)

In [14]:
tweets.rating_numerator.head()

0    13
1    13
2    12
3    13
4    12
Name: rating_numerator, dtype: int64

In [17]:
# tweets["w"] = tweets["text"].str.extract("(\d+)")
# tweets["w"].head()
tweets.drop("w", axis = 1, inplace = True)

<a id='assessing'></a>
#### Data Assessing

In [12]:
tweets.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1369,702332542343577600,,,2016-02-24 03:21:41 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",This is Rudy. He's going to be a star. 13/10 t...,,,,https://vine.co/v/irlDujgwOjd,13,10,Rudy,,,,
2347,666057090499244032,,,2015-11-16 00:55:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",My oh my. This is a rare blond Canadian terrie...,,,,https://twitter.com/dog_rates/status/666057090...,9,10,a,,,,
444,819238181065359361,,,2017-01-11 17:42:57 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Some happy pupper news to share. 10/10 for eve...,,,,http://us.blastingnews.com/news/2017/01/200-do...,10,10,,,,pupper,
274,840698636975636481,8.406983e+17,8.405479e+17,2017-03-11 22:59:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@0_kelvin_0 &gt;10/10 is reserved for puppos s...,,,,,10,10,,,,,
2097,670733412878163972,,,2015-11-28 22:38:01 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jerry. He's a great listener. Low main...,,,,https://twitter.com/dog_rates/status/670733412...,8,10,Jerry,,,,


In [13]:
summary(tweets)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [14]:
images.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
889,699088579889332224,https://pbs.twimg.com/media/CbOpWswWEAE9kvX.jpg,1,mousetrap,0.456186,False,banded_gecko,0.258677,False,common_iguana,0.061783,False
1848,839549326359670784,https://pbs.twimg.com/media/C6atpTLWYAIL7bU.jpg,1,swing,0.393527,False,Norwich_terrier,0.05248,True,Pembroke,0.049901,True
133,668480044826800133,https://pbs.twimg.com/media/CUbrDWOWcAEyMdM.jpg,1,Arctic_fox,0.119243,False,Labrador_retriever,0.099965,True,pug,0.086717,True
57,667073648344346624,https://pbs.twimg.com/media/CUHr8WbWEAEBPgf.jpg,1,Chihuahua,0.483682,True,pug,0.092494,True,Brabancon_griffon,0.057495,True
1078,717790033953034240,https://pbs.twimg.com/media/CfYaOeMWQAAGfyP.jpg,1,car_mirror,0.819106,False,minibus,0.107383,False,cab,0.034846,False


In [15]:
summary(images)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

 
 Duplicates = 0



tweet_id    0
jpg_url     0
img_num     0
p1          0
p1_conf     0
p1_dog      0
p2          0
p2_conf     0
p2_dog      0
p3          0
p3_conf     0
p3_dog      0
dtype: int64

In [16]:
images["img_num"].value_counts

<bound method IndexOpsMixin.value_counts of 0       1
1       1
2       1
3       1
4       1
       ..
2070    2
2071    1
2072    1
2073    1
2074    1
Name: img_num, Length: 2075, dtype: int64>

In [17]:
likes.sample(5)

Unnamed: 0,tweet_id,likes,retweets
1387,698703483621523456,1331,337
1483,691416866452082688,17897,6932
874,758474966123810816,3566,918
740,778286810187399168,9747,3064
1967,672591762242805761,846,286


In [18]:
summary(likes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2327 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   tweet_id  2327 non-null   int64
 1   likes     2327 non-null   int64
 2   retweets  2327 non-null   int64
dtypes: int64(3)
memory usage: 54.7 KB

 
 Duplicates = 0



tweet_id    0
likes       0
retweets    0
dtype: int64

##### Quality
`tweets`
- Remove retweets
- Dropping `source` column and `rating_denominator` column
- Renaming `rating_numerator` column
- Missing values for `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`
- Incorrect data types for `tweet_id`, `in_reply_to_status_id`, `in_reply_to_user_id`, `timestamp`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`

`images`
- Incorrect datatype for `tweet_id`

`likes`
- Incorrect datatype for `tweet_id`

##### Tidy
`tweets`
- Single variable split into multiple variables

<a id='cleaning'></a>
#### Data Cleaning

In [19]:
tweets_clean = tweets.copy()
images_clean = images.copy()
likes_clean = likes.copy()

##### _Define_
_**Fix missing values**: Since there is no way to get the values, and they are also meant to be unique, I will drop the columns_
##### _Code_

In [20]:
tweets_clean.drop(["in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", 
                "retweeted_status_user_id", "retweeted_status_timestamp", "expanded_urls"], 
              axis = 1, inplace = True)

##### _Test_

In [21]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   rating_numerator    2356 non-null   int64 
 5   rating_denominator  2356 non-null   int64 
 6   name                2356 non-null   object
 7   doggo               2356 non-null   object
 8   floofer             2356 non-null   object
 9   pupper              2356 non-null   object
 10  puppo               2356 non-null   object
dtypes: int64(3), object(8)
memory usage: 202.6+ KB


Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1207,715733265223708672,2016-04-01 02:51:22 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a taco. We only rate dogs. Please only...,10,10,a,,,,
1183,718939241951195136,2016-04-09 23:10:47 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Otis. Everybody look at Otis. 12/10 wo...,12,10,Otis,,,,
990,748705597323898880,2016-07-01 02:31:39 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",#BarkWeek is getting rather heckin terrifying ...,13,10,,,,,
395,825535076884762624,2017-01-29 02:44:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very loving and accepting puppo. Appe...,14,10,,,,,puppo
187,856282028240666624,2017-04-23 23:01:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...","This is Cermet, Paesh, and Morple. They are ab...",14,10,Cermet,,,,


##### _Define_
_**Remove all retweets**: Removing every row where the `text` startswith "[RT](https://stackoverflow.com/a/46693056/18102026)" and removing corresponding `likes` and `retweets`_
##### _Code_

In [22]:
id_val = tweets_clean[tweets_clean.text.str.startswith("RT")].tweet_id.values

In [23]:
tweets_clean = tweets_clean[~tweets_clean.text.str.startswith("RT")]

In [24]:
count = 0
for ids in id_val:
    try:
        loc = likes_clean.loc[likes_clean.tweet_id == ids].index[0]
        likes_clean.drop(index = loc, inplace = True)
    except IndexError:
        count += 1
count

21

##### _Test_

In [25]:
tweets_clean[tweets_clean.text.str.startswith("RT")]

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [26]:
clean_test(likes_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2165 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   tweet_id  2165 non-null   int64
 1   likes     2165 non-null   int64
 2   retweets  2165 non-null   int64
dtypes: int64(3)
memory usage: 67.7 KB


Unnamed: 0,tweet_id,likes,retweets
1093,730924654643314689,5658,1792
1434,694905863685980160,2543,841
1275,707420581654872064,2077,636
1354,700864154249383937,2405,561
389,823699002998870016,13221,2336


##### _Define_
_**Dropping the `source` and `rating_denominator` column**: I don't think the source column will be important in my analysis and the rating denominator is the same for all dogs_
##### _Code_

In [27]:
tweets_clean.drop(["source", "rating_denominator"], axis = 1, inplace = True)

##### _Test_

In [28]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2173 entries, 0 to 2355
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   tweet_id          2173 non-null   int64 
 1   timestamp         2173 non-null   object
 2   text              2173 non-null   object
 3   rating_numerator  2173 non-null   int64 
 4   name              2173 non-null   object
 5   doggo             2173 non-null   object
 6   floofer           2173 non-null   object
 7   pupper            2173 non-null   object
 8   puppo             2173 non-null   object
dtypes: int64(2), object(7)
memory usage: 169.8+ KB


Unnamed: 0,tweet_id,timestamp,text,rating_numerator,name,doggo,floofer,pupper,puppo
466,817171292965273600,2017-01-06 00:49:53 +0000,This is Tebow. He kindly requests that you put...,13,Tebow,,,,
173,858471635011153920,2017-04-30 00:02:42 +0000,This is Sophie. She just arrived. Used paworit...,13,Sophie,,,,
928,754747087846248448,2016-07-17 18:38:22 +0000,This is Keith. He's pursuing a more 2D lifesty...,12,Keith,,,,
1744,679158373988876288,2015-12-22 04:35:49 +0000,This is Rubio. He has too much skin. 11/10 htt...,11,Rubio,,,,
123,868880397819494401,2017-05-28 17:23:24 +0000,This is Walter. He won't start hydrotherapy wi...,14,Walter,,,,


##### _Define_
_**Renaming the `rating_numerator` column**: I'll change it to just `ratings` since the rating_denominator has been removed_
##### _Code_

In [29]:
tweets_clean.rename({"rating_numerator": "ratings"}, axis = 1, inplace = True)

##### _Test_

In [30]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2173 entries, 0 to 2355
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweet_id   2173 non-null   int64 
 1   timestamp  2173 non-null   object
 2   text       2173 non-null   object
 3   ratings    2173 non-null   int64 
 4   name       2173 non-null   object
 5   doggo      2173 non-null   object
 6   floofer    2173 non-null   object
 7   pupper     2173 non-null   object
 8   puppo      2173 non-null   object
dtypes: int64(2), object(7)
memory usage: 169.8+ KB


Unnamed: 0,tweet_id,timestamp,text,ratings,name,doggo,floofer,pupper,puppo
884,760190180481531904,2016-08-01 19:07:17 +0000,Meet Sadie. She's addicted to balloons. It's t...,10,Sadie,,,,
1881,675003128568291329,2015-12-10 17:24:21 +0000,Meet Laela. She's adorable. Magnificent eyes. ...,12,Laela,,,,
1079,739238157791694849,2016-06-04 23:31:25 +0000,Here's a doggo blowing bubbles. It's downright...,13,,doggo,,,
1486,693109034023534592,2016-01-29 16:30:45 +0000,"""Thank you friend that was a swell petting"" 11...",11,,,,,
278,840268004936019968,2017-03-10 18:27:58 +0000,This is Monty. He makes instantly regrettable ...,12,Monty,,,,


Due to the removal of the retweets, the number of rows has dropped from 2356 to 2173. Since most columns that had incorrect datatypes have been removed, that leaves only `tweet_id` and `timestamp`

##### _Define_
_**Changing data type**: Changing the data type of `tweet_id` because it is not going to be used for calculations. Also changing the `timestamp` column_
##### _Code_

In [31]:
tweets_clean.tweet_id = tweets_clean.tweet_id.astype(object)

In [32]:
tweets_clean.timestamp = pd.to_datetime(tweets.timestamp)

In [33]:
images_clean.tweet_id = images_clean.tweet_id.astype(object)

In [34]:
likes_clean.tweet_id = likes_clean.tweet_id.astype(object)

##### _Test_

In [35]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2173 entries, 0 to 2355
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   tweet_id   2173 non-null   object             
 1   timestamp  2173 non-null   datetime64[ns, UTC]
 2   text       2173 non-null   object             
 3   ratings    2173 non-null   int64              
 4   name       2173 non-null   object             
 5   doggo      2173 non-null   object             
 6   floofer    2173 non-null   object             
 7   pupper     2173 non-null   object             
 8   puppo      2173 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(1), object(7)
memory usage: 169.8+ KB


Unnamed: 0,tweet_id,timestamp,text,ratings,name,doggo,floofer,pupper,puppo
578,800751577355128832,2016-11-21 17:23:47+00:00,Say hello to Mauve and Murphy. They're rather ...,12,Mauve,,,,
624,795464331001561088,2016-11-07 03:14:10+00:00,Elder doggo does a splash. Both 13/10 incredib...,13,,doggo,,,
1656,683357973142474752,2016-01-02 18:43:31+00:00,"""Have a seat, son. There are some things we ne...",10,,,,,
1800,676975532580409345,2015-12-16 04:01:59+00:00,This is Cooper. He doesn't know how cheese wor...,11,Cooper,,,,
1507,691675652215414786,2016-01-25 17:35:00+00:00,This is Richie and Plip. They are the best of ...,10,Richie,,,,


In [36]:
clean_test(images)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
120,668190681446379520,https://pbs.twimg.com/media/CUXj4SgXAAETlu6.jpg,1,Blenheim_spaniel,0.958402,True,cocker_spaniel,0.026764,True,Welsh_springer_spaniel,0.00779,True
493,675707330206547968,https://pbs.twimg.com/media/CWCYOqWUAAARmGr.jpg,1,bath_towel,0.721933,False,Staffordshire_bullterrier,0.059344,True,bagel,0.035702,False
1004,708845821941387268,https://pbs.twimg.com/media/CdZTgynWwAATZcx.jpg,1,schipperke,0.74564,True,kelpie,0.167853,True,Boston_bull,0.014763,True
1090,719339463458033665,https://pbs.twimg.com/media/Cfuba6NW4AIeMHk.jpg,1,golden_retriever,0.765778,True,borzoi,0.071148,True,Leonberg,0.070371,True
1716,819006400881917954,https://pbs.twimg.com/media/C12x-JTVIAAzdfl.jpg,4,prison,0.907083,False,palace,0.020089,False,umbrella,0.00785,False


In [37]:
clean_test(likes_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2165 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tweet_id  2165 non-null   object
 1   likes     2165 non-null   int64 
 2   retweets  2165 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 67.7+ KB


Unnamed: 0,tweet_id,likes,retweets
2283,666776908487630848,305,150
145,862831371563274240,17226,4461
1014,743895849529389061,3410,881
925,751830394383790080,5448,1791
221,847962785489326080,21763,4588


##### _Define_
_**Merging**: Merging the tweets dataframe to the likes and retweets dataframe_
##### _Code_

In [38]:
tweets.tweet_id.values not in likes.tweet_id.values

  tweets.tweet_id.values not in likes.tweet_id.values


True

This implies that there are tweets in the `tweets` dataframe not in the `likes` dataframe as we could observe from the `.info()` method of the two dataframes.

In [39]:
data = tweets_clean.merge(likes_clean, on = "tweet_id")

##### _Test_

In [40]:
clean_test(data)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2165 entries, 0 to 2164
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   tweet_id   2165 non-null   object             
 1   timestamp  2165 non-null   datetime64[ns, UTC]
 2   text       2165 non-null   object             
 3   ratings    2165 non-null   int64              
 4   name       2165 non-null   object             
 5   doggo      2165 non-null   object             
 6   floofer    2165 non-null   object             
 7   pupper     2165 non-null   object             
 8   puppo      2165 non-null   object             
 9   likes      2165 non-null   int64              
 10  retweets   2165 non-null   int64              
dtypes: datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 203.0+ KB


Unnamed: 0,tweet_id,timestamp,text,ratings,name,doggo,floofer,pupper,puppo,likes,retweets
1392,687312378585812992,2016-01-13 16:36:55+00:00,"""You got any games on your phone"" 7/10 for inv...",7,,,,pupper,,4020,1737
660,770069151037685760,2016-08-29 01:22:47+00:00,Say hello to Carbon. This is his first time sw...,10,Carbon,,,,,7154,2127
777,751251247299190784,2016-07-08 03:07:09+00:00,This is Gilbert. He's being chased by a battal...,10,Gilbert,,,,,11750,5398
326,826204788643753985,2017-01-30 23:05:46+00:00,"Here's a little more info on Dew, your favorit...",13,,doggo,,,,4602,856
148,858843525470990336,2017-05-01 00:40:27+00:00,I have stumbled puppon a doggo painting party....,13,,doggo,,,,14134,3029


##### _Define_
_**Dropping the `text` column**: I figured there would be no need for the text column because the name of the dog already has it's column as well as the ratings_
##### _Code_

In [41]:
data.drop("text", axis = 1, inplace = True)

##### _Test_

In [42]:
clean_test(data)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2165 entries, 0 to 2164
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   tweet_id   2165 non-null   object             
 1   timestamp  2165 non-null   datetime64[ns, UTC]
 2   ratings    2165 non-null   int64              
 3   name       2165 non-null   object             
 4   doggo      2165 non-null   object             
 5   floofer    2165 non-null   object             
 6   pupper     2165 non-null   object             
 7   puppo      2165 non-null   object             
 8   likes      2165 non-null   int64              
 9   retweets   2165 non-null   int64              
dtypes: datetime64[ns, UTC](1), int64(3), object(6)
memory usage: 186.1+ KB


Unnamed: 0,tweet_id,timestamp,ratings,name,doggo,floofer,pupper,puppo,likes,retweets
1509,681242418453299201,2015-12-27 22:37:04+00:00,10,Champ,,,,,1361,470
1024,715696743237730304,2016-04-01 00:26:15+00:00,10,Toby,,,,,3622,1171
996,719332531645071360,2016-04-11 01:13:34+00:00,11,Pippa,,,,,3185,861
1846,671735591348891648,2015-12-01 17:00:19+00:00,11,Oscar,,,,,1289,657
1667,675531475945709568,2015-12-12 04:23:49+00:00,10,Ellie,,,,,1069,332


In [46]:
data.sample(10)

Unnamed: 0,tweet_id,timestamp,ratings,name,doggo,floofer,pupper,puppo,likes,retweets
1209,699788877217865730,2016-02-17 02:54:04+00:00,12,Nala,,,,,2066,470
1152,704871453724954624,2016-03-02 03:30:25+00:00,12,,,,,,3970,1019
1181,702539513671897089,2016-02-24 17:04:07+00:00,12,a,,,,,2655,875
581,782747134529531904,2016-10-03 01:00:34+00:00,11,Deacon,doggo,,,,7077,1319
1896,670804601705242624,2015-11-29 03:20:54+00:00,10,Mason,,,,,1743,822
350,821886076407029760,2017-01-19 01:04:45+00:00,13,Jimison,,,,,10683,2142
552,787322443945877504,2016-10-15 16:01:13+00:00,10,Lincoln,,,,,7406,1628
216,844580511645339650,2017-03-22 16:04:20+00:00,11,Charlie,,,,,15310,2822
576,783466772167098368,2016-10-05 00:40:09+00:00,11,Finn,,,,,7989,2077
874,741743634094141440,2016-06-11 21:27:17+00:00,11,Aqua,,,pupper,,7558,2505


In [44]:
images_clean.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1243,747219827526344708,https://pbs.twimg.com/media/Cl6odlVWQAIy5uk.jpg,2,Shetland_sheepdog,0.548018,True,marmot,0.165503,False,collie,0.043003,True
752,688116655151435777,https://pbs.twimg.com/media/CYyucekVAAESj8K.jpg,1,pug,0.973819,True,Chihuahua,0.010891,True,Staffordshire_bullterrier,0.006864,True
1044,712717840512598017,https://pbs.twimg.com/media/CeQVF1eVIAAJaTv.jpg,1,Great_Pyrenees,0.732043,True,kuvasz,0.121375,True,Irish_wolfhound,0.049524,True
1098,720389942216527872,https://pbs.twimg.com/media/Cf9W1J-UMAErahM.jpg,1,Pembroke,0.873977,True,Cardigan,0.043339,True,Eskimo_dog,0.019197,True
696,684481074559381504,https://pbs.twimg.com/media/CX_D6AJWwAAnBIw.jpg,1,Chihuahua,0.93781,True,Pomeranian,0.020307,True,polecat,0.017357,False
215,670055038660800512,https://pbs.twimg.com/media/CUyDgChWUAAmNSI.jpg,1,snail,0.563631,False,slug,0.296649,False,bolete,0.031839,False
114,667924896115245057,https://pbs.twimg.com/media/CUTyJpHWcAATl0O.jpg,1,Labrador_retriever,0.209051,True,hog,0.20398,False,Newfoundland,0.165914,True
1235,746507379341139972,https://pbs.twimg.com/media/Clwgf4bWgAAB15c.jpg,1,toy_poodle,0.508292,True,Lakeland_terrier,0.234458,True,affenpinscher,0.084563,True
1828,835574547218894849,https://pbs.twimg.com/media/C5iOnigWcAAU3Ry.jpg,1,Staffordshire_bullterrier,0.610655,True,muzzle,0.132138,False,American_Staffordshire_terrier,0.109544,True
1195,740365076218183684,https://pbs.twimg.com/media/CkZOGhJWsAAHvPv.jpg,1,bow_tie,0.246313,False,Windsor_tie,0.172446,False,mushroom,0.137516,False


<a id='analysis'></a>
## Data Analysis and Visualization

<a id='q1'></a>
#### Do the ratings influence the likes and retweets?

In [45]:
data.ratings.value_counts()

12      497
10      439
11      426
13      303
9       156
8        98
7        54
14       43
5        36
6        32
3        19
4        17
2         9
1         8
420       2
0         2
17        1
45        1
60        1
143       1
44        1
99        1
121       1
20        1
26        1
144       1
80        1
165       1
50        1
204       1
1776      1
27        1
75        1
24        1
84        1
960       1
15        1
182       1
666       1
88        1
Name: ratings, dtype: int64