# Project: Wrangling and Analyze Data

In [1]:
from configparser import ConfigParser
from io import BytesIO
from itertools import islice
from json import dump, loads
from os import environ
from pathlib import Path

import requests
import tweepy

import pandas as pd
from IPython.display import clear_output

<span style="color: red;">We'll store all downloaded and generated files in this directory. We create it if it doesn't exist.</span>

In [2]:
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

<span style="color: red;">We have to assume this file was downloaded manually, we'll first check if it exists, and if it doesn't, we'll download it and open it as if it was already on the filesystem.</span>

In [3]:
twitter_archived_enhanced_csv = data_dir / "twitter-archive-enhanced.csv"

In [4]:
if not twitter_archived_enhanced_csv.exists():
    with requests.get(
        "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv",
        stream=True
    ) as response:
        response.raise_for_status()
        with open(twitter_archived_enhanced_csv, "wb") as f:
            for chunk in response.iter_content(chunk_size=32768):  
                f.write(chunk)

<span style="color: red;">Before we open the file, let's check the first few lines to determine the proper way of open it with pandas:</span>

In [5]:
with open(twitter_archived_enhanced_csv, "r") as f:
    for line in islice(f, 5):
        print(line)

tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo

892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,None,None,None,None

892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,None,None,None,None

891815181378084864,,,20

<span style="color: red;">There's nothing out of the ordinary, so we can just use ``pandas.read_csv`` as usual:</span>

In [6]:
twitter_archived_enhanced_df = pd.read_csv(twitter_archived_enhanced_csv)

In [7]:
twitter_archived_enhanced_df.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

<span style="color: red;">Since we're hitting an external service, we first check if we haven't saved the loaded dataframe before. This isn't part of the wrangling process per se, but a convenience to avoid hitting external services mutiple times while running the notebook.</span>

In [8]:
image_predictions_df = None

In [9]:
image_predictions_df_pkl = data_dir / "image_predictions_df.pkl"

In [10]:
if image_predictions_df_pkl.exists():
    image_predictions_df = pd.read_pickle(image_predictions_df_pkl)

<span style="color: red;">If we haven't created the local pickled version, proceed with the expected download process:</span>

In [11]:
image_predictions_tsv_url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

In [12]:
if image_predictions_df is None:
    # First we read the first few lines of the file to determine the right way of ingesting it with pandas
    with requests.get(image_predictions_tsv_url) as response:
        response.raise_for_status()
        
        for line in islice(
            filter(lambda l: bool(l), response.iter_lines(decode_unicode=True)),  # Filter out empty lines
            5                                                                     # Take only the first 5 lines
        ):
            print(line)

<span style="color: red;">As we might have guessed by the file extension, this is a tab separated file with a header, which pandas can handle without any issues if the right ``sep`` value is used:</span>

In [13]:
if image_predictions_df is None:
    with requests.get(image_predictions_tsv_url) as response:
        response.raise_for_status()
        image_predictions_df = pd.read_csv(BytesIO(response.content), sep="\t")
        image_predictions_df.to_pickle(image_predictions_df_pkl)

<span style="color: red;">Alternatively, we could have saved the file and then open it with pandas:</span>

```python
with requests.get(image_predictions_tsv_url) as response:
    response.raise_for_status()
    
    with open(data_dir / "image_predictions.tsv", "wb") as f:
        f.write(response.content)
        
    image_predictions_df = pd.read_csv(data_dir / "image_predictions.tsv", sep="\t")
```

<span style="color: red;">Or even better, let pandas handle everything for us:</span>

```python     
image_predictions_df = pd.read_csv(image_predictions_tsv_url, sep="\t")
```

In [14]:
image_predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

<span style="color: red;">First we need to instantiate ``tweepy``'s ``API`` object. Notice that Twitter's credentials need to be set in ``config.ini``:</span>

In [15]:
config = ConfigParser()
config.read("config.ini");

In [16]:
assert "twitter" in config.sections()
assert {'consumer_key', 'consumer_secret', 'access_token', 'access_token_secret'} == set(config["twitter"].keys())
assert bool(config["twitter"]["consumer_key"])
assert bool(config["twitter"]["consumer_secret"])
assert bool(config["twitter"]["access_token"])
assert bool(config["twitter"]["access_token_secret"])

In [17]:
auth = tweepy.OAuthHandler(
    config["twitter"]["consumer_key"],
    config["twitter"]["consumer_secret"],
    access_token=config["twitter"]["access_token"],
    access_token_secret=config["twitter"]["access_token_secret"]
)
api = tweepy.API(
    auth,
    wait_on_rate_limit=True
)

<span style="color: red;">Given [Twitter's API rate limits](https://developer.twitter.com/en/docs/twitter-api/rate-limits), we need to compute the rate at which we can safely fetch the tweet info based on the total number of tweets. As of this writing, we can only fetch 900 tweets every 15 minutes, so I need to wait at least 1 second ((15 * 60) / 900) between each request. Given that the total number of unique tweets is:</span>

In [18]:
total_no_tweets = twitter_archived_enhanced_df.tweet_id.nunique()
print(total_no_tweets)

2356


<span style="color: red;">It'll take us about 40 minutes to fetch the tweet information.</span>

In [19]:
tweet_json_txt = data_dir / "tweet_json.txt"

In [20]:
errors = []  # We'll keep track of errors

if not tweet_json_txt.exists():
    with open(tweet_json_txt, "w") as f:
        # We're iterating of unique tweet ids in case there are duplicates on twitter_archived_enhanced_df 
        for i, tweet_id in enumerate(twitter_archived_enhanced_df.tweet_id.unique(), start=1):
            tweet = None
            
            try:
                tweet = api.get_status(
                    tweet_id,
                    tweet_mode='extended'  # Requesting extended tweets as per https://docs.tweepy.org/en/stable/extended_tweets.html#extended-mode
                )

                #
                # Print progress
                #
                clear_output(wait=True)
                print(f"{i}/{total_no_tweets}: {tweet_id}")
                
                #
                # Notice that we're using a JSONL (JSON lines) format. This makes it easier to monitor
                # the progress as we can tail tweet_json.txt while we're fetching the info.
                #
                dump(tweet._json, f)
                f.write("\n")
            except Exception as e:
                print(f"Exception raised while fetching tweet status. tweet_id: {tweet_id}, e: {e}")
                errors.append((tweet_id, e, tweet))

<span style="color: red;">Let's check how many errors were encountered during the fetch process:</span>

In [21]:
len(errors)  # It might be 0 if we skipped fetching the tweet information

0

<span style="color: red;">Let's show the first few lines of ``tweet_json.txt`` to verify it was written correctly. We've used the JSONL (JSON lines) format they're easier to work with then streaming JSON data.</span>

In [22]:
with open(tweet_json_txt, "r") as f:
    for line in islice(f, 5):  # Take only the first 5 lines
        print(line)

{"created_at": "Tue Aug 01 16:23:56 +0000 2017", "id": 892420643555336193, "id_str": "892420643555336193", "full_text": "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU", "truncated": false, "display_text_range": [0, 85], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 892420639486877696, "id_str": "892420639486877696", "indices": [86, 109], "media_url": "http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "media_url_https": "https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "url": "https://t.co/MgUWQ76dJU", "display_url": "pic.twitter.com/MgUWQ76dJU", "expanded_url": "https://twitter.com/dog_rates/status/892420643555336193/photo/1", "type": "photo", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "medium": {"w": 540, "h": 528, "resize": "fit"}, "small": {"w": 540, "h": 528, "resize": "fit"}, "large": {"w": 540, "h": 528, "resize": "fit"}}}]}, "extended_entities": {"media

<span style="color: red;">The first few lines are ok. Let's load all of the lines onto a list.</span>

In [23]:
tweet_json_txt_json = []
with open(tweet_json_txt, "r") as f:
    tweet_json_txt_json = [
        loads(line)
        for line in f
    ]
tweet_json_txt_json[:5]

[{'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
  'id': 892420643555336193,
  'id_str': '892420643555336193',
  'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
  'truncated': False,
  'display_text_range': [0, 85],
  'entities': {'hashtags': [],
   'symbols': [],
   'user_mentions': [],
   'urls': [],
   'media': [{'id': 892420639486877696,
     'id_str': '892420639486877696',
     'indices': [86, 109],
     'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
     'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
     'url': 'https://t.co/MgUWQ76dJU',
     'display_url': 'pic.twitter.com/MgUWQ76dJU',
     'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
     'type': 'photo',
     'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
      'medium': {'w': 540, 'h': 528, 'resize': 'fit'},
      'small': {'w': 540, 'h': 528, 'resize': 'fit'}

<span style="color: red;">Now we can create a dataframe with ``DataFrame.from_records``:</span>

In [24]:
tweet_json_txt_df = pd.DataFrame.from_records(tweet_json_txt_json)

In [25]:
tweet_json_txt_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2327 entries, 0 to 2326
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   created_at                     2327 non-null   object 
 1   id                             2327 non-null   int64  
 2   id_str                         2327 non-null   object 
 3   full_text                      2327 non-null   object 
 4   truncated                      2327 non-null   bool   
 5   display_text_range             2327 non-null   object 
 6   entities                       2327 non-null   object 
 7   extended_entities              2057 non-null   object 
 8   source                         2327 non-null   object 
 9   in_reply_to_status_id          77 non-null     float64
 10  in_reply_to_status_id_str      77 non-null     object 
 11  in_reply_to_user_id            77 non-null     float64
 12  in_reply_to_user_id_str        77 non-null     o

In [26]:
tweet_json_txt_df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
0,Tue Aug 01 16:23:56 +0000 2017,892420643555336193,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,False,False,en,,,,,
1,Tue Aug 01 00:17:27 +0000 2017,892177421306343426,892177421306343426,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,False,False,en,,,,,
2,Mon Jul 31 00:18:03 +0000 2017,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,False,False,en,,,,,
3,Sun Jul 30 15:58:51 +0000 2017,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,False,False,en,,,,,
4,Sat Jul 29 16:00:24 +0000 2017,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,False,False,en,,,,,


<span style="color: red;">Since we only need id, retweet count, and favorite count, let's project the rest of the columns out:</span>

In [27]:
tweet_json_txt_df = tweet_json_txt_df[["id", "retweet_count", "favorite_count"]].rename({"id": "tweet_id"}, axis=1)

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### twitter_archived_enhanced_df

In [28]:
twitter_archived_enhanced_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [29]:
twitter_archived_enhanced_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

<span style="color: red;">No missing text, ratings, names, or stages.</span>

In [30]:
twitter_archived_enhanced_df.tweet_id.duplicated().sum()  # No duplicate tweets

0

In [31]:
(~twitter_archived_enhanced_df.in_reply_to_status_id.isna()).sum()  # There are 78 replies

78

In [33]:
(~twitter_archived_enhanced_df.retweeted_status_id.isna()).sum()  # There are 181 retweets

181

<span style="color: red;">Let's filter out reples and retweets (only for assessing purposes):</span>

In [34]:
twitter_archived_enhanced_valid_df = twitter_archived_enhanced_df[
    twitter_archived_enhanced_df.in_reply_to_status_id.isna()
    & twitter_archived_enhanced_df.retweeted_status_id.isna()
]

In [35]:
twitter_archived_enhanced_valid_df.shape[0]  # We've got 2097 potentially valid tweets

2097

<span style="color: red;">Let's look for rating numerators/denominators outside of the tranditional WeRateDogs rating system. Historically, the ranking goes from 0/10 to 15/10. Let's look at numerators:</span>

In [36]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator > 15
].rating_numerator.value_counts()

84      1
24      1
88      1
144     1
26      1
121     1
44      1
60      1
45      1
80      1
99      1
50      1
204     1
1776    1
165     1
27      1
75      1
420     1
Name: rating_numerator, dtype: int64

In [37]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 84
][["tweet_id", "text"]].values

array([[820690176645140481,
        'The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd']],
      dtype=object)

In [38]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 24
][["tweet_id", "text"]].values

array([[810984652412424192,
        'Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx']],
      dtype=object)

In [39]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 88
][["tweet_id", "text"]].values

array([[675853064436391936,
        'Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw']],
      dtype=object)

In [40]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 26
][["tweet_id", "text"]].values

array([[680494726643068929,
        'Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD']],
      dtype=object)

In [41]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 121
][["tweet_id", "text"]].values

array([[684222868335505415,
        'Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55']],
      dtype=object)

In [42]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 44
][["tweet_id", "text"]].values

array([[697463031882764288,
        "Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ"]],
      dtype=object)

In [43]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 60
][["tweet_id", "text"]].values

array([[704054845121142784,
        "Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa"]],
      dtype=object)

In [44]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 45
][["tweet_id", "text"]].values

array([[709198395643068416,
        'From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK']],
      dtype=object)

In [45]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 80
][["tweet_id", "text"]].values

array([[710658690886586372,
        "Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12"]],
      dtype=object)

In [46]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 99
][["tweet_id", "text"]].values

array([[713900603437621249,
        "Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1"]],
      dtype=object)

In [47]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 50
][["tweet_id", "text"]].values

array([[716439118184652801,
        'This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq']],
      dtype=object)

In [48]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 204
][["tweet_id", "text"]].values

array([[731156023742988288,
        'Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv']],
      dtype=object)

<span style="color: red;">A joke rating but technically still valid:</span>

In [49]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 1776
][["tweet_id", "text"]].values

array([[749981277374128128,
        "This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh"]],
      dtype=object)

In [50]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 165
][["tweet_id", "text"]].values

array([[758467244762497024,
        'Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE']],
      dtype=object)

In [51]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 27
][["tweet_id", "text"]].values

array([[778027034220126208,
        "This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq"]],
      dtype=object)

In [52]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 75
][["tweet_id", "text"]].values

array([[786709082849828864,
        "This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS"]],
      dtype=object)

In [53]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_numerator == 420
][["tweet_id", "text"]].values

array([[670842764863651840,
        'After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY']],
      dtype=object)

<span style="color: red;">Now let's check the denominators. Anything that is not a 10 and does not contain multiple doggos is to be considered suspicious.</span>

In [54]:
twitter_archived_enhanced_valid_df[
    (twitter_archived_enhanced_valid_df.rating_denominator != 10) &
    (~twitter_archived_enhanced_valid_df.tweet_id.isin([
        820690176645140481, 675853064436391936, 677716515794329600, 684222868335505415, 697463031882764288,
        704054845121142784, 709198395643068416, 710658690886586372, 713900603437621249, 731156023742988288,
        758467244762497024
    ]))
].rating_denominator.value_counts()

11    2
7     1
20    1
50    1
2     1
Name: rating_denominator, dtype: int64

In [55]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_denominator == 11
][["tweet_id", "text"]].values

array([[740373189193256964,
        'After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ'],
       [682962037429899265,
        'This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5']],
      dtype=object)

In [56]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_denominator == 7
][["tweet_id", "text"]].values

array([[810984652412424192,
        'Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx']],
      dtype=object)

In [57]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_denominator == 20
][["tweet_id", "text"]].values

array([[722974582966214656,
        'Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a']],
      dtype=object)

In [58]:
twitter_archived_enhanced_valid_df[
    twitter_archived_enhanced_valid_df.rating_denominator == 2
][["tweet_id", "text"]].values

array([[666287406224695296,
        'This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv']],
      dtype=object)

<span style="color: red;">Now let's check the names:</span>

In [59]:
twitter_archived_enhanced_valid_df.name.value_counts()

None          603
a              55
Lucy           11
Charlie        11
Oliver         10
             ... 
Lenox           1
Harvey          1
Blanket         1
Burt            1
Christoper      1
Name: name, Length: 955, dtype: int64

In [60]:
twitter_archived_enhanced_valid_df[twitter_archived_enhanced_valid_df.name == "a"][["tweet_id", "text"]].values

array([[881536004380872706,
        'Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF'],
       [792913359805018113,
        'Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq'],
       [772581559778025472,
        'Guys this is getting so out of hand. We only rate dogs. This is a Galapagos Speed Panda. Pls only send dogs... 10/10 https://t.co/8lpAGaZRFn'],
       [747885874273214464,
        'This is a mighty rare blue-tailed hammer sherk. Human almost lost a limb trying to take these. Be careful guys. 8/10 https://t.co/TGenMeXreW'],
       [747816857231626240,
        'Viewer discretion is advised. This is a terrible attack in progress. Not even in water (tragic af). 4/10 bad sherk https://t.co/L3U0j14N5R'],
       [746872823977771008,
        'This is a carrot. We only rate dogs. Please only send in 

<span style="color: red;">Now let's check the stages:</span>

In [61]:
twitter_archived_enhanced_valid_df.doggo.value_counts()

None     2014
doggo      83
Name: doggo, dtype: int64

In [62]:
twitter_archived_enhanced_valid_df.floofer.value_counts()

None       2087
floofer      10
Name: floofer, dtype: int64

In [63]:
twitter_archived_enhanced_valid_df.pupper.value_counts()

None      1867
pupper     230
Name: pupper, dtype: int64

In [64]:
twitter_archived_enhanced_valid_df.puppo.value_counts()

None     2073
puppo      24
Name: puppo, dtype: int64

### image_predictions_df

In [65]:
image_predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [66]:
image_predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [67]:
image_predictions_df.shape[0]  # There's a different number of records from twitter_archived_enhanced_df

2075

<span style="color: red;">How many dog preditions are there? Are there any rows without a dog prediction?</span>

In [68]:
(image_predictions_df.p1_dog | image_predictions_df.p2_dog | image_predictions_df.p3_dog).sum()

1751

In [69]:
((~image_predictions_df.p1_dog) & (~image_predictions_df.p2_dog) & (~image_predictions_df.p3_dog)).sum()

324

In [70]:
image_predictions_df[(~image_predictions_df.p1_dog) & (~image_predictions_df.p2_dog) & (~image_predictions_df.p3_dog)].head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
17,666104133288665088,https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg,1,hen,0.965932,False,cock,0.033919,False,partridge,5.2e-05,False
18,666268910803644416,https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg,1,desktop_computer,0.086502,False,desk,0.085547,False,bookcase,0.07948,False
21,666293911632134144,https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg,1,three-toed_sloth,0.914671,False,otter,0.01525,False,great_grey_owl,0.013207,False
25,666362758909284353,https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg,1,guinea_pig,0.996496,False,skunk,0.002402,False,hamster,0.000461,False


<span style="color: red;">Are the confidence values consistent?</span>

In [71]:
image_predictions_df.p1_conf.describe()

count    2075.000000
mean        0.594548
std         0.271174
min         0.044333
25%         0.364412
50%         0.588230
75%         0.843855
max         1.000000
Name: p1_conf, dtype: float64

In [72]:
image_predictions_df.p2_conf.describe()

count    2.075000e+03
mean     1.345886e-01
std      1.006657e-01
min      1.011300e-08
25%      5.388625e-02
50%      1.181810e-01
75%      1.955655e-01
max      4.880140e-01
Name: p2_conf, dtype: float64

In [73]:
image_predictions_df.p3_conf.describe()

count    2.075000e+03
mean     6.032417e-02
std      5.090593e-02
min      1.740170e-10
25%      1.622240e-02
50%      4.944380e-02
75%      9.180755e-02
max      2.734190e-01
Name: p3_conf, dtype: float64

<span style="color: red;">Anything weird on the dog breeds?</span>

In [74]:
image_predictions_df[image_predictions_df.p1_dog].p1.value_counts()

golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
pug                    57
                     ... 
Japanese_spaniel        1
Scotch_terrier          1
standard_schnauzer      1
EntleBucher             1
clumber                 1
Name: p1, Length: 111, dtype: int64

In [75]:
image_predictions_df[image_predictions_df.p2_dog].p2.value_counts()

Labrador_retriever      104
golden_retriever         92
Cardigan                 73
Chihuahua                44
Pomeranian               42
                       ... 
affenpinscher             1
Japanese_spaniel          1
Kerry_blue_terrier        1
komondor                  1
Bernese_mountain_dog      1
Name: p2, Length: 113, dtype: int64

In [76]:
image_predictions_df[image_predictions_df.p3_dog].p3.value_counts()

Labrador_retriever    79
Chihuahua             58
golden_retriever      48
Eskimo_dog            38
kelpie                35
                      ..
Irish_wolfhound        2
affenpinscher          1
Kerry_blue_terrier     1
standard_schnauzer     1
Sussex_spaniel         1
Name: p3, Length: 116, dtype: int64

### Quality issues

1. There are 78 replies and 181 retweets on ``twitter_archived_enhanced_df``.

2. On tweets ``820690176645140481``, ``675853064436391936``, ``677716515794329600``, ``684222868335505415``, ``697463031882764288``, ``704054845121142784``, ``709198395643068416``, ``710658690886586372``, ``713900603437621249``, ``731156023742988288``, ``758467244762497024`` there are multiple doggos with unusual ratings.

3. On tweets ``810984652412424192``, ``716439118184652801``, ``716439118184652801``, ``740373189193256964``, ``682962037429899265``, ``722974582966214656``, ``666287406224695296`` there are multiple expressions confused for ratings.

4. On tweets ``680494726643068929``, ``778027034220126208``, ``786709082849828864`` there are fractional ratings.

5. On tweet ``670842764863651840`` the rating is invalid as the subject is not a doggo.

6. Although the rating on tweet ``749981277374128128`` is technically valid (1776), we consider it an outlier.

7. Invalid name "a" on 55 tweets.

8. There were missing tweets when fetching the retweet and favourite counts with Twitter's API.

9. There are 324 records without a dog prediction on image_predictions_df.

### Tidiness issues

1. Multiple tables to represent the same observational unit: the tweet.

2. Multiple columns for the doggo stage variable.

3. Multiple columns to represent the breed prediction and confidence.

4. ``timestamp`` is a string, could be a datetime.

5. Dog breed names could be made consitent and readable.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization