In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


# Project: Analysis on the WeRateDogs Twitter Archive

## Table of Contents
- Introduction
    - <a href = "#imports"> Notebook Imports </a>
    - <a href = "#constants"> Constants </a>
    - <a href = "#functions"> Function Definitions </a>
- Data Wrangling
    - <a href = "#gathering"> Data Gathering </a>
    - <a href = "#assessing"> Data Assessing </a>
    - <a href = "#cleaning"> Data Cleaning </a>
- <a href = "#analysis"> Data Analysis and Visualization </a>
- Conclusions


## Introduction

> WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. People may want to buy dogs that are humorous or  otherwise. The goal of the project is to analyze and know the humour level of each breed of dog. Each dog is rated differently based on humour level with a common rating denominator of 10 but with _weird_ levels of rating numerator, some higher than 10, but that is all part of the fun of the `@dog_rates` twitter account. From the descriptin of the project, some ratings in the `twitter-archive-enhanced.csv` were not accurately pulled from the `text`.

> Only eight (8) quality issues and two (2) tidiness issues, at minimum, were to be identified and addressed due to the robustness of the data.

> Likely questions to be asked include:
> - <a href="#q1">Do the ratings influence the likes and retweets?</a>
> - <a href="#q2">Is There a Relationship Between Budget and Revenue?</a>
> - <a href="#q3">Do the Cast Influence the Revenue?</a>
> - <a href="#q4">Do the Directors Influence the Revenue?</a>

<a id='imports'></a>
### Notebook Imports

In [2]:
import pandas as pd  # For Data Manipulation
import numpy as np  # For Array Manipulation

import os  # For File Access
import requests  # For Web Access

import tweepy  # For Accessing Twitter's API
import json # To manipulate JSON files

import re  # For Regular Expressions

import matplotlib.pyplot as matpy  # For Data Visualization
import seaborn as sb  # For Data Visualization

<a id='constants'></a>
### Constants

In [3]:
IMAGE_URL = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
IMAGE_FILE = IMAGE_URL.split("/")[-1]
API_FILE = "twitter_json.txt"

API_KEY = input("Enter API key: ")
API_SECRET = input("Enter API secret: ")
ACCESS_TOKEN = input("Enter access token: ")
ACCESS_SECRET = input("Enter access secret: ")

Enter API key: 
Enter API secret: 
Enter access token: 
Enter access secret: 


<a id='functions'></a>
### Function Definitions

In [4]:
def summary(df):
    """
    Returns a description of the DataFrame
    """
    info = df.info()
    isna = df.isna().sum()
    duplicates = df.duplicated().any().sum()
    
    print(f"\n \n Duplicates = {duplicates}\n")
    return isna 

In [5]:
def clean_test(df):
    """
    Returns the info and head of the cleaned data frame
    """
    info = df.info()
    head = df.sample(5)
    
    return head

## Data Wrangling

<a id='gathering'></a>
#### Data Gathering
1. Getting the data from a file.
2. Downloading the file programmatically and loading it.
3. Getting the data from Twitter's API.

In [6]:
# Reading files
tweets = pd.read_csv("twitter-archive-enhanced.csv")

In [7]:
# Downloading file
response = requests.get(IMAGE_URL)

# os.getcwd() gets your current working directory
with open(os.path.join(os.getcwd(), IMAGE_FILE), "wb") as file:
    file.write(response.content)

In [7]:
# Checking the contents of the working directory
os.listdir(os.getcwd())

['.git',
 '.ipynb_checkpoints',
 'image-predictions.tsv',
 'README.md',
 'twitter-archive-enhanced.csv',
 'twitter_json.txt',
 'WeRateDogs Twitter Archive.ipynb']

In [8]:
# Reading the downloaded file
images = pd.read_csv(IMAGE_FILE, sep = "\t")

In [7]:
# Authenticating the API
auth = tweepy.OAuth1UserHandler(API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True)
api.verify_credentials()

User(_api=<tweepy.api.API object at 0x000001C492334850>, _json={'id': 1466314425569325058, 'id_str': '1466314425569325058', 'name': 'n.e.d.u.m', 'screen_name': 'n_e_d_u_m', 'location': '', 'description': 'Python Developer | Data Analyst', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 68, 'friends_count': 60, 'listed_count': 0, 'created_at': 'Thu Dec 02 07:55:26 +0000 2021', 'favourites_count': 371, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 48, 'lang': None, 'status': {'created_at': 'Mon Jun 13 02:12:36 +0000 2022', 'id': 1536169635430682624, 'id_str': '1536169635430682624', 'text': 'Numpy, pandas matplotlib https://t.co/UFJkjkiaxG', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/UFJkjkiaxG', 'expanded_url': 'https://twitter.com/oliverjumpertz/status/1535941015437291521', 'display_url': 'twitter.com/oliverjumpertz…'

In [8]:
tweet_ids = tweets["tweet_id"].values
print(len(tweet_ids))
tweet_ids

2356


array([892420643555336193, 892177421306343426, 891815181378084864, ...,
       666033412701032449, 666029285002620928, 666020888022790149],
      dtype=int64)

In [9]:
%%time  # To calculate the time it will take to run the code 
failed = {}  # To store the failed retrievals
step = 0

# Opening the file in write mode
with open(API_FILE, "w") as file:
    # looping through each id
    for ids in tweet_ids:  
        step += 1
        if step % 5 == 0:
            print(f"{step}: {ids}")
        
        # Fetching the data
        try:
            tweet = api.get_status(ids, tweet_mode='extended')
            json.dump(tweet._json, file)
            file.write('\n')
            
        # Storing the errors
        except tweepy.errors.TweepyException as tte:
            failed[ids] = tte
            pass

print(failed)
len(failed)

5: 891327558926688256
10: 890240255349198849
15: 889531135344209921
20: 888202515573088257
25: 887343217045368832
30: 886366144734445568
35: 885528943205470208
40: 884876753390489601
45: 883838122936631299
50: 882762694511734784
55: 881666595344535552
60: 880872448815771648
65: 879674319642796034
70: 879050749262655488
75: 878316110768087041
80: 877556246731214848
85: 876484053909872640
90: 875021211251597312
95: 874012996292530176
100: 872967104147763200
105: 872261713294495745
110: 871166179821445120
115: 870656317836468226
120: 869772420881756160
125: 868639477480148993
130: 867421006826221569
135: 866686824827068416
140: 865359393868664832
145: 863907417377173506
150: 863079547188785154
155: 862096992088072192
160: 860981674716409858
165: 860184849394610176
170: 859196978902773760
175: 858107933456039936
180: 857214891891077121
185: 856526610513747968
190: 855860136149123072
195: 855245323840757760
200: 854120357044912130
205: 852936405516943360
210: 852226086759018497
215: 8514648

1690: 681340665377193984
1695: 681281657291280384
1700: 680970795137544192
1705: 680889648562991104
1710: 680609293079592961
1715: 680440374763077632
1720: 680161097740095489
1725: 680085611152338944
1730: 679862121895714818
1735: 679736210798047232
1740: 679511351870550016
1745: 679158373988876288
1750: 679047485189439488
1755: 678798276842360832
1760: 678740035362037760
1765: 678424312106393600
1770: 678380236862578688
1775: 678023323247357953
1780: 677716515794329600
1785: 677662372920729601
1790: 677547928504967168
1795: 677328882937298944
1800: 677187300187611136
1805: 676946864479084545
1810: 676864501615042560
1815: 676617503762681856
1820: 676590572941893632
1825: 676496375194980353
1830: 676237365392908289
1835: 676121918416756736
1840: 675891555769696257
1845: 675849018447167488
1850: 675781562965868544
1855: 675534494439489536
1860: 675497103322386432
1865: 675362609739206656
1870: 675153376133427200
1875: 675135153782571009
1880: 675015141583413248
1885: 674800520222154752


In [9]:
os.listdir(os.getcwd())

['.git',
 '.ipynb_checkpoints',
 'image-predictions.tsv',
 'README.md',
 'twitter-archive-enhanced.csv',
 'twitter_json.txt',
 'WeRateDogs Twitter Archive.ipynb']

In [10]:
# Creatig a list to hold the cleaned dictionaries
api_list = [json.loads(line) for line in open(API_FILE, "r")]
api_list[0]

{'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
 'truncated': False,
 'display_text_range': [0, 85],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'medium': {'w': 540, 'h': 528, 'resize': 'fit'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'large': {'w': 

In [11]:
api_df = []  # List to hold all the dictionaries for the data frame
# Looping through the clean dictionaries
for dicts in api_list:
    # Gathering entities
    tweet_id = dicts["id"]
    likes = dicts["favorite_count"]
    retweets = dicts["retweet_count"]
    
    # Creating the data frame dictionary
    api_dict = {
        "tweet_id": tweet_id, 
        "likes": likes, 
        "retweets": retweets
    }

    api_df.append(api_dict)
        
api_df

[{'tweet_id': 892420643555336193, 'likes': 33823, 'retweets': 7010},
 {'tweet_id': 892177421306343426, 'likes': 29337, 'retweets': 5302},
 {'tweet_id': 891815181378084864, 'likes': 22069, 'retweets': 3482},
 {'tweet_id': 891689557279858688, 'likes': 36948, 'retweets': 7228},
 {'tweet_id': 891327558926688256, 'likes': 35321, 'retweets': 7764},
 {'tweet_id': 891087950875897856, 'likes': 17815, 'retweets': 2602},
 {'tweet_id': 890971913173991426, 'likes': 10368, 'retweets': 1666},
 {'tweet_id': 890729181411237888, 'likes': 56882, 'retweets': 15759},
 {'tweet_id': 890609185150312448, 'likes': 24531, 'retweets': 3624},
 {'tweet_id': 890240255349198849, 'likes': 27969, 'retweets': 6101},
 {'tweet_id': 890006608113172480, 'likes': 27047, 'retweets': 6154},
 {'tweet_id': 889880896479866881, 'likes': 24570, 'retweets': 4168},
 {'tweet_id': 889665388333682689, 'likes': 42073, 'retweets': 8353},
 {'tweet_id': 889638837579907072, 'likes': 23683, 'retweets': 3718},
 {'tweet_id': 889531135344209921,

In [12]:
likes = pd.DataFrame(api_df)

<a id='assessing'></a>
#### Data Assessing

In [13]:
tweets.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2225,668286279830867968,,,2015-11-22 04:33:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Rusty. Rusty's dreaming of a world where ...,,,,https://twitter.com/dog_rates/status/668286279...,11,10,Rusty,,,,
1193,717537687239008257,,,2016-04-06 02:21:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",People please. This is a Deadly Mediterranean ...,,,,https://twitter.com/dog_rates/status/717537687...,11,10,a,,,,
1078,739485634323156992,,,2016-06-05 15:54:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Kyle. He's a heavy drinker and an avid...,,,,https://twitter.com/dog_rates/status/739485634...,6,10,Kyle,,,,
1776,677961670166224897,,,2015-12-18 21:20:32 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",This is Izzy. She's showing off the dance move...,,,,https://vine.co/v/iKuMDuYV0aZ,11,10,Izzy,,,,
877,760893934457552897,,,2016-08-03 17:43:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Wishes. He has the day off. Daily stru...,,,,https://twitter.com/dog_rates/status/760893934...,11,10,Wishes,doggo,,,


In [14]:
tweets.rating_numerator.head()

0    13
1    13
2    12
3    13
4    12
Name: rating_numerator, dtype: int64

In [15]:
summary(tweets)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [16]:
images.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2012,879050749262655488,https://pbs.twimg.com/media/DDMD_phXoAQ1qf0.jpg,1,tabby,0.311861,False,window_screen,0.169123,False,Egyptian_cat,0.132932,False
209,669972011175813120,https://pbs.twimg.com/media/CUw3_QiUEAA8cT9.jpg,1,teddy,0.953071,False,koala,0.007027,False,fur_coat,0.005368,False
166,668981893510119424,https://pbs.twimg.com/media/CUize-0WEAAerAK.jpg,1,jellyfish,0.447246,False,coral_reef,0.238625,False,goldfish,0.040227,False
399,673686845050527744,https://pbs.twimg.com/media/CVlqi_AXIAASlcD.jpg,1,Pekinese,0.185903,True,guinea_pig,0.172951,False,pug,0.166183,True
1433,773547596996571136,https://pbs.twimg.com/media/Crwxb5yWgAAX5P_.jpg,1,Norwegian_elkhound,0.372202,True,Chesapeake_Bay_retriever,0.137187,True,malamute,0.071436,True


In [17]:
summary(images)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

 
 Duplicates = 0



tweet_id    0
jpg_url     0
img_num     0
p1          0
p1_conf     0
p1_dog      0
p2          0
p2_conf     0
p2_dog      0
p3          0
p3_conf     0
p3_dog      0
dtype: int64

In [18]:
images["img_num"].value_counts

<bound method IndexOpsMixin.value_counts of 0       1
1       1
2       1
3       1
4       1
       ..
2070    2
2071    1
2072    1
2073    1
2074    1
Name: img_num, Length: 2075, dtype: int64>

In [19]:
likes.sample(5)

Unnamed: 0,tweet_id,likes,retweets
893,756303284449767430,3707,1006
400,822462944365645825,26330,13163
884,757597904299253760,0,278
209,850019790995546112,18805,4382
1686,680440374763077632,1330,443


In [20]:
summary(likes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2327 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   tweet_id  2327 non-null   int64
 1   likes     2327 non-null   int64
 2   retweets  2327 non-null   int64
dtypes: int64(3)
memory usage: 54.7 KB

 
 Duplicates = 0



tweet_id    0
likes       0
retweets    0
dtype: int64

##### Quality
`tweets`
- Remove retweets
- Dropping `source` column and `rating_denominator` column
- Renaming `rating_numerator` column
- Incorrect `ratings`
- Missing values for `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`
- Incorrect data types for `tweet_id`, `in_reply_to_status_id`, `in_reply_to_user_id`, `timestamp`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`

`images`
- Incorrect datatype for `tweet_id`
- Renaming every column

`likes`
- Incorrect datatype for `tweet_id`

##### Tidy
`tweets`
- Single variable split into multiple variables
- Combine `likes` and `retweets`

<a id='cleaning'></a>
#### Data Cleaning

In [21]:
tweets_clean = tweets.copy()
images_clean = images.copy()
likes_clean = likes.copy()

##### _Define_
_**Fix missing values**: Since there is no way to get the values, and they are also meant to be unique, I will drop the columns_
##### _Code_

In [22]:
tweets_clean.drop(["in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", 
                "retweeted_status_user_id", "retweeted_status_timestamp", "expanded_urls"], 
              axis = 1, inplace = True)

##### _Test_

In [23]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   rating_numerator    2356 non-null   int64 
 5   rating_denominator  2356 non-null   int64 
 6   name                2356 non-null   object
 7   doggo               2356 non-null   object
 8   floofer             2356 non-null   object
 9   pupper              2356 non-null   object
 10  puppo               2356 non-null   object
dtypes: int64(3), object(8)
memory usage: 202.6+ KB


Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2250,667832474953625600,2015-11-20 22:30:44 +0000,"<a href=""http://twitter.com/download/iphone"" r...",THE EYES 12/10\n\nI'm sorry. These are suppose...,12,10,,,,,
1162,723688335806480385,2016-04-23 01:41:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Reginald. He starts screaming at rando...,12,10,Reginald,,,,
1801,676957860086095872,2015-12-16 02:51:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",10/10 I'd follow this dog into battle no quest...,10,10,,,,,
1480,693486665285931008,2016-01-30 17:31:20 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Lincoln. He doesn't understand his new...,11,10,Lincoln,,,,
570,801854953262350336,2016-11-24 18:28:13 +0000,"<a href=""http://twitter.com/download/iphone"" r...",.@NBCSports OMG THE TINY HAT I'M GOING TO HAVE...,11,10,,,,,


##### _Define_
_**Merge dog stages**: The various dogs stages are meant to be in one column_
##### _Code_

In [69]:
# Replacing Nones with empty strings for easier manipulation
tweets_clean.doggo.replace('None','',inplace = True)
tweets_clean.floofer.replace('None','',inplace = True)
tweets_clean.pupper.replace('None','',inplace = True)
tweets_clean.puppo.replace('None','',inplace = True)

In [70]:
tweets_clean["dog_stages"] = tweets_clean.doggo + tweets_clean.floofer + tweets_clean.pupper + tweets_clean.puppo

In [71]:
tweets_clean["dog_stages"].value_counts()

                1829
pupper           224
doggo             75
puppo             24
doggopupper       10
floofer            9
doggopuppo         1
doggofloofer       1
Name: dog_stages, dtype: int64

In [72]:
tweets_clean["dog_stages"].replace(",", ", ", inplace=e = True)

In [76]:
tweets_clean.loc[tweets_clean.dog_stages == '', 'dog_stages'] = np.nan

In [78]:
tweets_clean.drop(["doggo", "floofer", "pupper", "puppo"], axis = 1, inplace = True)

##### _Test_

In [79]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2173 entries, 0 to 2355
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   tweet_id    2173 non-null   object             
 1   timestamp   2173 non-null   datetime64[ns, UTC]
 2   text        2173 non-null   object             
 3   ratings     2173 non-null   float64            
 4   name        2173 non-null   object             
 5   dog_stages  344 non-null    object             
dtypes: datetime64[ns, UTC](1), float64(1), object(4)
memory usage: 118.8+ KB


Unnamed: 0,tweet_id,timestamp,text,ratings,name,dog_stages
2270,667502640335572993,2015-11-20 00:40:05+00:00,Say hello to Hall and Oates. Oates is winking ...,11.0,Hall,
1849,675781562965868544,2015-12-12 20:57:34+00:00,Say hello to Buddah. He was Waldo for Hallowee...,11.0,Buddah,
2305,666837028449972224,2015-11-18 04:35:11+00:00,My goodness. Very rare dog here. Large. Tail d...,3.0,,
1742,679462823135686656,2015-12-23 00:45:35+00:00,Meet Hurley. He's the curly one. He hugs every...,11.0,Hurley,
2170,669351434509529089,2015-11-25 03:06:32+00:00,This is Sully. He's a Leviticus Galapagos. Ver...,10.0,Sully,


##### _Define_
_**Remove all retweets**: Removing every row where the `text` startswith "[RT](https://stackoverflow.com/a/46693056/18102026)" and removing corresponding `likes` and `retweets`_
##### _Code_

In [24]:
id_val = tweets_clean[tweets_clean.text.str.startswith("RT")].tweet_id.values

In [25]:
tweets_clean = tweets_clean[~tweets_clean.text.str.startswith("RT")]

In [26]:
count = 0
for ids in id_val:
    try:
        loc = likes_clean.loc[likes_clean.tweet_id == ids].index[0]
        likes_clean.drop(index = loc, inplace = True)
    except IndexError:
        count += 1
count

21

##### _Test_

In [27]:
tweets_clean[tweets_clean.text.str.startswith("RT")]

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [28]:
clean_test(likes_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2165 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   tweet_id  2165 non-null   int64
 1   likes     2165 non-null   int64
 2   retweets  2165 non-null   int64
dtypes: int64(3)
memory usage: 67.7 KB


Unnamed: 0,tweet_id,likes,retweets
1227,710609963652087808,4446,2164
1988,672160042234327040,765,304
2264,667152164079423490,42313,14623
84,876120275196170240,24697,3905
1627,683391852557561860,7059,2181


##### _Define_
_**Dropping the `source` and `rating_denominator` column**: I don't think the source column will be important in my analysis and the rating denominator is the same for all dogs_
##### _Code_

In [29]:
tweets_clean.drop(["source", "rating_denominator"], axis = 1, inplace = True)

##### _Test_

In [30]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2173 entries, 0 to 2355
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   tweet_id          2173 non-null   int64 
 1   timestamp         2173 non-null   object
 2   text              2173 non-null   object
 3   rating_numerator  2173 non-null   int64 
 4   name              2173 non-null   object
 5   doggo             2173 non-null   object
 6   floofer           2173 non-null   object
 7   pupper            2173 non-null   object
 8   puppo             2173 non-null   object
dtypes: int64(2), object(7)
memory usage: 169.8+ KB


Unnamed: 0,tweet_id,timestamp,text,rating_numerator,name,doggo,floofer,pupper,puppo
168,859607811541651456,2017-05-03 03:17:27 +0000,Sorry for the lack of posts today. I came home...,13,,,,,puppo
426,821407182352777218,2017-01-17 17:21:47 +0000,This is Sundance. He's a doggo drummer. Even s...,14,Sundance,doggo,,,
1318,706538006853918722,2016-03-06 17:52:42 +0000,"This is Oliver (pronounced ""Ricardo""). He's a ...",11,Oliver,,,,
1649,683742671509258241,2016-01-03 20:12:10 +0000,Meet Sebastian. He's a womanizer. Romantic af....,11,Sebastian,,,,
846,766313316352462849,2016-08-18 16:38:26 +0000,This is Oscar. He has legendary eyebrows and h...,12,Oscar,,,,


##### _Define_
_**Renaming the `rating_numerator` column**: I'll change it to just `ratings` since the rating_denominator has been removed_
##### _Code_

In [31]:
tweets_clean.rename({"rating_numerator": "ratings"}, axis = 1, inplace = True)

##### _Test_

In [61]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2173 entries, 0 to 2355
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   tweet_id   2173 non-null   object             
 1   timestamp  2173 non-null   datetime64[ns, UTC]
 2   text       2173 non-null   object             
 3   ratings    2173 non-null   float64            
 4   name       2173 non-null   object             
 5   doggo      2173 non-null   object             
 6   floofer    2173 non-null   object             
 7   pupper     2173 non-null   object             
 8   puppo      2173 non-null   object             
dtypes: datetime64[ns, UTC](1), float64(1), object(7)
memory usage: 169.8+ KB


Unnamed: 0,tweet_id,timestamp,text,ratings,name,doggo,floofer,pupper,puppo
45,883482846933004288,2017-07-08 00:28:19+00:00,This is Bella. She hopes her smile made you sm...,13.5,Bella,,,,
1941,673715861853720576,2015-12-07 04:09:13+00:00,This is a heavily opinionated dog. Loves walls...,4.0,a,,,,
1277,709042156699303936,2016-03-13 15:43:18+00:00,This is Klevin. He's addicted to sandwiches (y...,9.0,Klevin,,,,
510,812372279581671427,2016-12-23 19:00:19+00:00,This is Moe. He's a fetty woof. Got a cardboar...,13.0,Moe,,,,
314,835172783151792128,2017-02-24 17:01:22+00:00,We only rate dogs. Please don't send in any no...,12.0,,,,,


##### _Define_
_**Renaming every column in the `images` data frame**: Giving every column comprehesensive names_
##### _Code_

In [63]:
images_clean.rename({
    "img_num": "image_number",
    "p1": "prediction1", 
    "p1_conf": "prediction1_confidence", 
    "p1_dog": "prediction1_dog", 
    "p2": "prediction2", 
    "p2_conf": "prediction2_confidence", 
    "p2_dog": "prediction2_dog", 
    "p3": "prediction3", 
    "p3_conf": "prediction3_confidence", 
    "p3_dog": "prediction3_dog"
}, axis = 1, inplace = True)

##### _Test_

In [64]:
clean_test(images_clean)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   tweet_id                2075 non-null   object 
 1   jpg_url                 2075 non-null   object 
 2   image_number            2075 non-null   int64  
 3   prediction1             2075 non-null   object 
 4   prediction1_confidence  2075 non-null   float64
 5   prediction1_dog         2075 non-null   bool   
 6   prediction2             2075 non-null   object 
 7   prediction2_confidence  2075 non-null   float64
 8   prediction2_dog         2075 non-null   bool   
 9   prediction3             2075 non-null   object 
 10  prediction3_confidence  2075 non-null   float64
 11  prediction3_dog         2075 non-null   bool   
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 152.1+ KB


Unnamed: 0,tweet_id,jpg_url,image_number,prediction1,prediction1_confidence,prediction1_dog,prediction2,prediction2_confidence,prediction2_dog,prediction3,prediction3_confidence,prediction3_dog
699,684594889858887680,https://pbs.twimg.com/media/CYAra7JWsAACPZH.jpg,1,Weimaraner,0.948688,True,English_setter,0.035352,True,Brittany_spaniel,0.003879,True
895,699446877801091073,https://pbs.twimg.com/media/CbTvNpoW0AEemnx.jpg,3,Pembroke,0.9694,True,Cardigan,0.026059,True,Chihuahua,0.003505,True
31,666421158376562688,https://pbs.twimg.com/media/CT-aggCXAAIMfT3.jpg,1,Blenheim_spaniel,0.906777,True,cocker_spaniel,0.090346,True,Shih-Tzu,0.001117,True
741,687317306314240000,https://pbs.twimg.com/media/CYnXcLEUkAAIQOM.jpg,1,Shih-Tzu,0.747208,True,Maltese_dog,0.091025,True,Lhasa,0.035788,True
195,669583744538451968,https://pbs.twimg.com/media/CUrW3DWXIAAiRqk.jpg,1,candle,0.174315,False,lampshade,0.120407,False,plunger,0.072099,False


##### _Define_
_**Getting correct `ratings`**: Making sure all ratings are correct_
##### _Code_

In [32]:
fix = tweets_clean[tweets_clean.text.str.contains("(\d+\.\d*\/\d+)")].text.values

  fix = tweets_clean[tweets_clean.text.str.contains("(\d+\.\d*\/\d+)")].text.values


In [53]:
for item in fix:
    val = tweets_clean["text"] == item
    tweets_clean.loc[val, "ratings"] = re.findall(r"\d+\.\d+", item)

##### _Test_

In [60]:
tweets_clean[tweets_clean.text.str.contains(r"(\d+\.\d*\/\d+)")]

  tweets_clean[tweets_clean.text.str.contains(r"(\d+\.\d*\/\d+)")]


Unnamed: 0,tweet_id,timestamp,text,ratings,name,doggo,floofer,pupper,puppo
45,883482846933004288,2017-07-08 00:28:19+00:00,This is Bella. She hopes her smile made you sm...,13.5,Bella,,,,
695,786709082849828864,2016-10-13 23:23:56+00:00,"This is Logan, the Chow who lived. He solemnly...",9.75,Logan,,,,
763,778027034220126208,2016-09-20 00:24:34+00:00,This is Sophie. She's a Jubilant Bush Pupper. ...,11.27,Sophie,,,pupper,
1689,681340665377193984,2015-12-28 05:07:27+00:00,I've been told there's a slight possibility he...,9.5,,,,,
1712,680494726643068929,2015-12-25 21:06:00+00:00,Here we have uncovered an entire battalion of ...,11.26,,,,,


In [55]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2173 entries, 0 to 2355
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   tweet_id   2173 non-null   object             
 1   timestamp  2173 non-null   datetime64[ns, UTC]
 2   text       2173 non-null   object             
 3   ratings    2173 non-null   object             
 4   name       2173 non-null   object             
 5   doggo      2173 non-null   object             
 6   floofer    2173 non-null   object             
 7   pupper     2173 non-null   object             
 8   puppo      2173 non-null   object             
dtypes: datetime64[ns, UTC](1), object(8)
memory usage: 169.8+ KB


Unnamed: 0,tweet_id,timestamp,text,ratings,name,doggo,floofer,pupper,puppo
1742,679462823135686656,2015-12-23 00:45:35+00:00,Meet Hurley. He's the curly one. He hugs every...,11,Hurley,,,,
1760,678708137298427904,2015-12-20 22:46:44+00:00,Here we are witnessing a wild field pupper. Lo...,10,,,,pupper,
1424,697990423684476929,2016-02-12 03:47:39+00:00,This is Doug. He's a Draconian Jabbawockee. Ra...,11,Doug,,,,
1072,739979191639244800,2016-06-07 00:36:02+00:00,This is Nollie. She's waving at you. If you do...,12,Nollie,,,,
134,866686824827068416,2017-05-22 16:06:55+00:00,This is Lili. She can't believe you betrayed h...,12,Lili,,,,


Due to the removal of the retweets, the number of rows has dropped from 2356 to 2173. Since most columns that had incorrect datatypes have been removed, that leaves only `tweet_id` and `timestamp`. Also I noticed that after pulling the `ratings` from the `text`, the data type changed to `object`. 

##### _Define_
_**Changing data type**: Changing the data type of `tweet_id` because it is not going to be used for calculations. Also changing the `timestamp` column_
##### _Code_

In [37]:
tweets_clean.tweet_id = tweets_clean.tweet_id.astype(object)

In [38]:
tweets_clean.timestamp = pd.to_datetime(tweets.timestamp)

In [58]:
tweets_clean.ratings = tweets_clean.ratings.astype("float64")

In [40]:
images_clean.tweet_id = images_clean.tweet_id.astype(object)

In [41]:
likes_clean.tweet_id = likes_clean.tweet_id.astype(object)

##### _Test_

In [59]:
clean_test(tweets_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2173 entries, 0 to 2355
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   tweet_id   2173 non-null   object             
 1   timestamp  2173 non-null   datetime64[ns, UTC]
 2   text       2173 non-null   object             
 3   ratings    2173 non-null   float64            
 4   name       2173 non-null   object             
 5   doggo      2173 non-null   object             
 6   floofer    2173 non-null   object             
 7   pupper     2173 non-null   object             
 8   puppo      2173 non-null   object             
dtypes: datetime64[ns, UTC](1), float64(1), object(7)
memory usage: 169.8+ KB


Unnamed: 0,tweet_id,timestamp,text,ratings,name,doggo,floofer,pupper,puppo
1201,716447146686459905,2016-04-03 02:08:05+00:00,This is Ozzy. He's acrobatic af. Legendary pup...,13.0,Ozzy,,,pupper,
356,830583320585068544,2017-02-12 01:04:29+00:00,This is Lilly. She just parallel barked. Kindl...,13.0,Lilly,,,,
929,754482103782404096,2016-07-17 01:05:25+00:00,This is Milo. He's currently plotting his reve...,10.0,Milo,,,,
2325,666418789513326592,2015-11-17 00:53:15+00:00,This is Walter. He is an Alaskan Terrapin. Lov...,10.0,Walter,,,,
1964,673350198937153538,2015-12-06 03:56:12+00:00,This is Bella. She's a Genghis Flopped Canuck....,9.0,Bella,,,,


In [43]:
clean_test(images)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
296,671362598324076544,https://pbs.twimg.com/media/CVEouDRXAAEe8mt.jpg,1,tub,0.393616,False,bathtub,0.383522,False,swimming_trunks,0.077301,False
1868,844223788422217728,https://pbs.twimg.com/media/C7dJCnqU4AAswat.jpg,1,Labrador_retriever,0.71951,True,Chesapeake_Bay_retriever,0.122019,True,Newfoundland,0.038828,True
1182,738402415918125056,https://pbs.twimg.com/media/Cj9VEs_XAAAlTai.jpg,1,cocker_spaniel,0.346695,True,Blenheim_spaniel,0.193905,True,Chihuahua,0.078,True
215,670055038660800512,https://pbs.twimg.com/media/CUyDgChWUAAmNSI.jpg,1,snail,0.563631,False,slug,0.296649,False,bolete,0.031839,False
313,671561002136281088,https://pbs.twimg.com/media/CVHdK-7WwAAsuyc.jpg,1,Gordon_setter,0.469373,True,black-and-tan_coonhound,0.270893,True,Rottweiler,0.153233,True


In [44]:
clean_test(likes_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2165 entries, 0 to 2326
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tweet_id  2165 non-null   object
 1   likes     2165 non-null   int64 
 2   retweets  2165 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 67.7+ KB


Unnamed: 0,tweet_id,likes,retweets
1577,685681090388975616,1889,110
1841,675149409102012420,3357,1503
1340,702539513671897089,2655,875
858,760252756032651264,3714,798
1504,690005060500217858,3392,1540


##### _Define_
_**Merging**: Merging the tweets dataframe to the likes and retweets dataframe_
##### _Code_

In [45]:
tweets.tweet_id.values not in likes.tweet_id.values

  tweets.tweet_id.values not in likes.tweet_id.values


True

This implies that there are tweets in the `tweets` dataframe not in the `likes` dataframe as we could observe from the `.info()` method of the two dataframes.

In [46]:
data = tweets_clean.merge(likes_clean, on = "tweet_id")

##### _Test_

In [47]:
clean_test(data)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2165 entries, 0 to 2164
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   tweet_id   2165 non-null   object             
 1   timestamp  2165 non-null   datetime64[ns, UTC]
 2   text       2165 non-null   object             
 3   ratings    2165 non-null   int64              
 4   name       2165 non-null   object             
 5   doggo      2165 non-null   object             
 6   floofer    2165 non-null   object             
 7   pupper     2165 non-null   object             
 8   puppo      2165 non-null   object             
 9   likes      2165 non-null   int64              
 10  retweets   2165 non-null   int64              
dtypes: datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 203.0+ KB


Unnamed: 0,tweet_id,timestamp,text,ratings,name,doggo,floofer,pupper,puppo,likes,retweets
873,741793263812808706,2016-06-12 00:44:30+00:00,When your crush won't pay attention to you. Bo...,10,,,,,,4220,1388
2042,668226093875376128,2015-11-22 00:34:50+00:00,Sneaky dog here. Tuba player has no clue. 10/1...,10,,,,,,247,85
1631,676590572941893632,2015-12-15 02:32:17+00:00,After some outrage from the crowd. Bubbles is ...,7,,,,,,841,110
104,870063196459192321,2017-05-31 23:43:25+00:00,Meet Clifford. He's quite large. Also red. Goo...,14,Clifford,,,,,32432,7078
1494,681981167097122816,2015-12-29 23:32:35+00:00,This is Jimothy. He's a Trinidad Poliwhirl. Fa...,12,Jimothy,,,,,2601,922


##### _Define_
_**Dropping the `text` column**: I figured there would be no need for the text column because the name of the dog already has it's column as well as the ratings_
##### _Code_

In [48]:
# data.drop("text", axis = 1, inplace = True)

##### _Test_

In [49]:
# clean_test(data)

In [50]:
data.sample(10)

Unnamed: 0,tweet_id,timestamp,text,ratings,name,doggo,floofer,pupper,puppo,likes,retweets
2135,666411507551481857,2015-11-17 00:24:19+00:00,This is quite the dog. Gets really excited whe...,2,quite,,,,,371,262
918,735256018284875776,2016-05-24 23:47:49+00:00,This is Kellogg. He accidentally opened the fr...,8,Kellogg,doggo,,,,3076,804
1682,675147105808306176,2015-12-11 02:56:28+00:00,When you're presenting a group project and the...,4,,,,,,842,200
1887,670838202509447168,2015-11-29 05:34:25+00:00,Pink dogs here. Unreasonably long necks. Left ...,1,,,,,,984,599
1893,670815497391357952,2015-11-29 04:04:12+00:00,This is Sage. He likes to burn shit. 10/10 htt...,10,Sage,,,,,2792,1323
879,740995100998766593,2016-06-09 19:52:53+00:00,This is Leo. He's a vape god. Blows o's for da...,10,Leo,,,,,5839,2478
851,744995568523612160,2016-06-20 20:49:19+00:00,This is Abby. She got her face stuck in a glas...,9,Abby,,,,puppo,2752,569
5,891087950875897856,2017-07-29 00:08:17+00:00,Here we have a majestic great white breaching ...,13,,,,,,17815,2602
1006,717841801130979328,2016-04-06 22:29:56+00:00,This is Barclay. His father was a banana. 11/1...,11,Barclay,,,,,2281,545
817,748220828303695873,2016-06-29 18:25:21+00:00,Stop what you're doing and watch this heckin m...,13,,,,,,13163,7128


In [51]:
images_clean.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
760,688828561667567616,https://pbs.twimg.com/media/CY816snW8AYltrQ.jpg,1,Cardigan,0.614231,True,skunk,0.139392,False,toilet_tissue,0.031158,False
1220,743980027717509120,https://pbs.twimg.com/media/ClMl4VLUYAA5qBb.jpg,1,bull_mastiff,0.97573,True,Rhodesian_ridgeback,0.008073,True,pug,0.005571,True
587,679062614270468097,https://pbs.twimg.com/media/CWyD2HGUYAQ1Xa7.jpg,2,cash_machine,0.802333,False,schipperke,0.045519,True,German_shepherd,0.023353,True
96,667724302356258817,https://pbs.twimg.com/media/CUQ7tv3W4AA3KlI.jpg,1,ibex,0.619098,False,bighorn,0.125119,False,ram,0.074673,False
236,670428280563085312,https://pbs.twimg.com/media/CU3W9ELWEAEdUA0.jpg,1,chow,0.335269,True,golden_retriever,0.30585,True,Tibetan_mastiff,0.063325,True
77,667437278097252352,https://pbs.twimg.com/media/CUM2qWaWoAUZ06L.jpg,1,porcupine,0.989154,False,bath_towel,0.0063,False,badger,0.000966,False
833,693942351086120961,https://pbs.twimg.com/media/CaFg41YWkAAdOjy.jpg,1,groenendael,0.550796,True,Norwegian_elkhound,0.15477,True,schipperke,0.080802,True
271,670823764196741120,https://pbs.twimg.com/media/CU8-puBWwAAR8Xl.jpg,1,Labrador_retriever,0.947453,True,German_short-haired_pointer,0.017001,True,Weimaraner,0.015432,True
288,671159727754231808,https://pbs.twimg.com/media/CVBwNjVWwAAlUFQ.jpg,1,pitcher,0.117446,False,sunglasses,0.062487,False,mask,0.059517,False
1633,805932879469572096,https://pbs.twimg.com/media/Cy8_qt0UUAAHuuN.jpg,1,Norwegian_elkhound,0.657967,True,keeshond,0.319136,True,Leonberg,0.007947,True


<a id='analysis'></a>
## Data Analysis and Visualization

<a id='q1'></a>
#### Do the ratings influence the likes and retweets?

In [52]:
data.ratings.value_counts()

12      485
10      426
11      410
13      296
9       149
8       100
7        50
14       40
6        35
5        33
3        28
2        24
4        24
1        13
0         7
31        2
100       2
50        2
17        2
80        2
420       2
2015      2
97        1
121       1
143       1
61        1
88        1
20        1
92        1
144       1
44        1
22        1
2016      1
2003      1
15        1
33        1
165       1
60        1
400       1
45        1
204       1
47        1
1776      1
2002      1
46        1
98        1
24        1
84        1
960       1
182       1
236       1
150       1
1949      1
Name: ratings, dtype: int64