# Project: Wrangling and Analyze Data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessing">Assessing Data</a></li>
<li><a href="#cleaning">Cleaning Data</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Descriptions

#### WeRateDogs Twitter Archive Data


### Import necessary libraries

In [44]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json
import tweepy
from tweepy import OAuthHandler
from timeit import default_timer as timer
import os
import requests
import lxml
from dotenv import load_dotenv

%matplotlib inline

<a id='gathering'></a>
## Data Gathering

1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)
2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)
3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

### Read in the WeRateDogs Twitter archive data downloaded directly from Udacity

In [4]:
df_archive = pd.read_csv('./data/twitter-archive-enhanced.csv')

#### Creating functions to be used for creating folders, files, and downloading data using the Requests library

In [5]:
def get_data_content(url):
    """
    Return the content of the response of provided url

    This script requires that `requests` be installed within the Python
    environment you are running this script in.

    Parameter
    -----------
    url : str

    Returns
    ----------
    byte
        Response from url as bytes
    """
    
    response = requests.get(url)
    return response.content

def create_folder(folder_name):
    """
    Creates a folder in immediate path if the folder name does not already exist

    This script requires that `os` be installed within the Python environment you 
    are running this script in.

    Parameter
    -----------
    folder_name : str
    """
    
    # Checks directory for folder, folder_name
    if not os.path.exists(folder_name):
        os.makedirs(folder_name) # If folder_name was not found, makes new folder, folder_name

def create_file(url, folder_name):
    """
    Creates a file with the contents from `get_data_content` in folder if the file 
    name does not already exist

    This script requires that `os` be installed within the Python environment you 
    are running this script in.

    Parameters
    -----------
    url : str
    folder_name : str
    """
    
    file_name = url.split('/')[-1] # Splits the url string at the last / and keeps the text to the right of it
    # If file_name does not exists inside folder_name, save contents of response
    # response.content is stored as bytes so mode argument is set to 'wb' for write binary
    with open(os.path.join(folder_name,
                           file_name),
                           mode='wb') as file:
        if not os.path.exists(file_name):
            file.write(get_data_content(url))

def create_data_file(url, folder_name):
    """
    Create file with contents from the response of a URL inside a folder

    Parameters
    -----------
    url : str
    folder_name : str    
    """
    
    create_folder(folder_name)
    create_file(url, folder_name)

### Use the Requests library to download the tweet image prediction data

In [6]:
create_data_file('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv',
                 'data')

In [7]:
df_predictions = pd.read_csv('./data/image-predictions.tsv', delimiter='\t')

### Use the Tweepy library to query additional data via the Twitter API

Below is the code provided by Udacity to students that are unable to use the API (which will be all students now that Twitter has changed their API access). 

**Please note:**
> Twitter no longer allows v1 API access. The free plan for v2 does not include lookup of tweets. The basic plan for v2 can retrieve up to 10,000 tweets per month at a rate limit of 15 requests/15 minutes. However, the basic plan costs $100/month
> 
> I had already created a dotenv file to store my API credentials to keep them hidden
> 
> The cell below is in Raw format to avoid it being run since the code assumes using v1.1 of Twitter's API and I was only able to get v2 credentials

Since I am unable to use the Twitter API to gather the data I need I will be using the resulting data of the code above as provided by Udacity.

> I will be doing this following the same method to get the image-predictions.tsv data

In [8]:
create_data_file('https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt',
                 'data')

In [9]:
df_list = [] # Creates empty list to build DataFrame later
with open('./data/tweet-json.txt', encoding='utf-8') as file:
    for line in file:
        tweet = json.loads(line) # Store line as JSON format to make the process of accessing data easier than using slice method
        tweet_id = tweet['id_str'] # Get tweet id as string
        retweet_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']
        # Add dict to list to use for creating a DataFrame
        df_list.append({'tweet_id': tweet_id,
                        'retweet_count': retweet_count,
                        'favorite_count': favorite_count})

# Create DataFrame using list created from file
df_json = pd.DataFrame(df_list, columns = ['tweet_id', 'retweet_count', 'favorite_count'])

<a id='assessing'></a>
## Assessing Data

<ul>
<li><a href="#archive">Assessing df_archive</a></li>
<li><a href="#predictions">Assessing df_predictions</a></li>
<li><a href="#json">Assessing df_json</a></li>
<li><a href="#quality">Quality Issues</a></li>
<li><a href="#Tidiness">Tidiness Issues</a></li>
</ul>
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

<ul>
<li><a href="archiveo"Assessing df_archiven</a></li>
<li><a href=predictionsngAssessing df_predictionsng</a></li>
<li><a hrefjsonedAssessing df_jsonsis</a></li>
<li><a hrequalityioQuality Issuesions</a><
<li><a href="#Tidiness">Tidiness Issues</a></li>/li>
l>




In [10]:
def check_for_dupes(data, col):
    """
    Checks for duplicate values in each column of a dataset (dupe or dupes)
    Returns column name, number of duplicated values, and list of duplicated values with frequency of each value (if applicable)

    Parameter
    -----------
    data : DataFrame
    col : Series from DataFrame
    """
    
    dupe_count = data[col].duplicated().sum()
    if dupe_count == 0:
        return f'{col} has {dupe_count} dupes\n'
    elif dupe_count == 1:
        return f'{col} has {dupe_count} dupe:\n{data[col].value_counts()}\n'
    else:
        return f'{col} has {dupe_count} dupes:\n{data[col].value_counts()}\n'

<a id='archive'></a>
### Assessing `df_archive`

#### Visual assessment

In [11]:
df_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


##### Visual Assessment notes:

* Names are missing
* Names are 'a'
* Rows are missing a classification for one of doggo, floofer, pupper, or puppo

#### Programmatic Assessment

In [12]:
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        1611 non-null   object 
 13  doggo                       97 no

##### .info() Notes:

* 78 rows are replies
* 181 rows are retweets (which for tweets can be considered duplicated data since it is a repost of original tweet)
* 59 rows are missing values in the exapnded_urls column
* Name is missing value for 745 rows
* Doggo, floofer, pupper, and puppo are not completed for a majority of the records
* in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id are float and should be string
* timestamp and retweeted_status_timestamp are object and should be string

In [13]:
df_archive.duplicated().any()

False

In [14]:
# Print out the duplicate values for each column in the data set
for col in df_archive.columns:
    print(check_for_dupes(df_archive, col), '\n', '--------------------------------------------', '\n')

tweet_id has 0 dupes
 
 -------------------------------------------- 

in_reply_to_status_id has 2278 dupes:
in_reply_to_status_id
6.671522e+17    2
8.862664e+17    1
6.920419e+17    1
6.827884e+17    1
6.842229e+17    1
               ..
8.116272e+17    1
8.131273e+17    1
8.211526e+17    1
8.233264e+17    1
6.670655e+17    1
Name: count, Length: 77, dtype: int64
 
 -------------------------------------------- 

in_reply_to_user_id has 2324 dupes:
in_reply_to_user_id
4.196984e+09    47
2.195506e+07     2
2.281182e+09     1
1.132119e+08     1
1.637468e+07     1
4.670367e+08     1
1.198989e+09     1
2.878549e+07     1
2.319108e+09     1
3.589728e+08     1
4.717297e+09     1
1.584641e+07     1
7.305050e+17     1
2.916630e+07     1
2.918590e+08     1
1.185634e+07     1
2.068372e+07     1
1.582854e+09     1
4.738443e+07     1
3.058208e+07     1
2.625958e+07     1
2.894131e+09     1
8.405479e+17     1
1.361572e+07     1
1.943518e+08     1
2.792810e+08     1
1.806710e+08     1
7.759620e+07  

In [15]:
df_archive.query('expanded_urls == "https://twitter.com/dog_rates/status/667152164079423490/photo/1"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
273,840728873075638272,,,2017-03-12 00:59:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Pipsy. He is a fluffbal...,6.671522e+17,4196984000.0,2015-11-19 01:27:25 +0000,https://twitter.com/dog_rates/status/667152164...,12,10,Pipsy,,,,
2293,667152164079423490,,,2015-11-19 01:27:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Pipsy. He is a fluffball. Enjoys trave...,,,,https://twitter.com/dog_rates/status/667152164...,12,10,Pipsy,,,,


In [16]:
df_archive['tweet_id'].duplicated().any()

False

In [17]:
df_archive['doggo'].unique()

array([nan, 'doggo'], dtype=object)

In [18]:
df_archive['floofer'].unique()

array([nan, 'floofer'], dtype=object)

In [19]:
df_archive['pupper'].unique()

array([nan, 'pupper'], dtype=object)

In [20]:
df_archive['puppo'].unique()

array([nan, 'puppo'], dtype=object)

In [21]:
df_archive['name'].value_counts()

name
a             55
Charlie       12
Oliver        11
Cooper        11
Lucy          11
              ..
Aqua           1
Chase          1
Meatball       1
Rorie          1
Christoper     1
Name: count, Length: 956, dtype: int64

In [22]:
df_archive['rating_denominator'].sort_values().unique()

array([  0,   2,   7,  10,  11,  15,  16,  20,  40,  50,  70,  80,  90,
       110, 120, 130, 150, 170], dtype=int64)

In [23]:
df_archive['rating_denominator'].value_counts()

rating_denominator
10     2333
11        3
50        3
20        2
80        2
70        1
7         1
15        1
150       1
170       1
0         1
90        1
40        1
130       1
110       1
16        1
120       1
2         1
Name: count, dtype: int64

In [24]:
df_archive['rating_numerator'].sort_values().unique()

array([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,   10,
         11,   12,   13,   14,   15,   17,   20,   24,   26,   27,   44,
         45,   50,   60,   75,   80,   84,   88,   99,  121,  143,  144,
        165,  182,  204,  420,  666,  960, 1776], dtype=int64)

In [25]:
df_archive['rating_numerator'].value_counts()

rating_numerator
12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
2         9
1         9
75        2
15        2
420       2
0         2
80        1
144       1
17        1
26        1
20        1
121       1
143       1
44        1
60        1
45        1
50        1
99        1
204       1
1776      1
165       1
666       1
27        1
182       1
24        1
960       1
84        1
88        1
Name: count, dtype: int64

<a id='predictions'></a>
### Assessing `df_predictions`

#### Visual assessment

In [26]:
df_predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


#### Programmatic Assessment

In [27]:
df_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [28]:
df_predictions.duplicated().any()

False

In [29]:
# Print out the duplicate values for each column in the data set
for col in df_predictions.columns:
    print(check_for_dupes(df_predictions, col), '\n', '--------------------------------------------', '\n')

tweet_id has 0 dupes
 
 -------------------------------------------- 

jpg_url has 66 dupes:
jpg_url
https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg                                            2
https://pbs.twimg.com/media/Cq9guJ5WgAADfpF.jpg                                            2
https://pbs.twimg.com/ext_tw_video_thumb/807106774843039744/pu/img/8XZg1xW35Xp2J6JW.jpg    2
https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg                                            2
https://pbs.twimg.com/media/CsrjryzWgAAZY00.jpg                                            2
                                                                                          ..
https://pbs.twimg.com/media/CXrmMSpUwAAdeRj.jpg                                            1
https://pbs.twimg.com/media/CXrawAhWkAAWSxC.jpg                                            1
https://pbs.twimg.com/media/CXrIntsUsAEkv0d.jpg                                            1
https://pbs.twimg.com/media/CXqcOHCUQAAugTB.jpg               

In [30]:
df_predictions.query('jpg_url == "https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg"')

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
800,691416866452082688,https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg,1,Lakeland_terrier,0.530104,True,Irish_terrier,0.197314,True,Airedale,0.082515,True
1624,803692223237865472,https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg,1,Lakeland_terrier,0.530104,True,Irish_terrier,0.197314,True,Airedale,0.082515,True


In [31]:
df_archive.query('tweet_id == 691416866452082688')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1511,691416866452082688,,,2016-01-25 00:26:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",I present to you... Dog Jesus. 13/10 (he could...,,,,https://twitter.com/dog_rates/status/691416866...,13,10,,,,,


In [32]:
df_archive.query('tweet_id == 803692223237865472')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
555,803692223237865472,,,2016-11-29 20:08:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: I present to you... Dog Jesus. ...,6.914169e+17,4196984000.0,2016-01-25 00:26:41 +0000,https://twitter.com/dog_rates/status/691416866...,13,10,,,,,


In [33]:
df_predictions['jpg_url'].duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
2070    False
2071    False
2072    False
2073    False
2074    False
Name: jpg_url, Length: 2075, dtype: bool

In [34]:
df_predictions.sample(15)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
243,670452855871037440,https://pbs.twimg.com/media/CU3tUC4WEAAoZby.jpg,1,Arctic_fox,0.188174,False,indri,0.123584,False,malamute,0.080379,True
584,678969228704284672,https://pbs.twimg.com/media/CWwu6OLUkAEo3gq.jpg,1,Labrador_retriever,0.680251,True,Chesapeake_Bay_retriever,0.201697,True,golden_retriever,0.019676,True
207,669942763794931712,https://pbs.twimg.com/media/CUwdYL5UsAAP0XX.jpg,1,vizsla,0.743216,True,redbone,0.217282,True,Rhodesian_ridgeback,0.028473,True
1244,747461612269887489,https://pbs.twimg.com/media/Cl-EXHSWkAE2IN2.jpg,1,binoculars,0.192717,False,barbershop,0.085838,False,ballplayer,0.084672,False
117,668142349051129856,https://pbs.twimg.com/media/CUW37BzWsAAlJlN.jpg,1,Angora,0.918834,False,hen,0.037793,False,wood_rabbit,0.011015,False
1923,857029823797047296,https://pbs.twimg.com/media/C-TIEwMW0AEjb55.jpg,2,golden_retriever,0.968623,True,Labrador_retriever,0.010325,True,Saluki,0.004148,True
1785,829374341691346946,https://pbs.twimg.com/media/C4KHj-nWQAA3poV.jpg,1,Staffordshire_bullterrier,0.757547,True,American_Staffordshire_terrier,0.14995,True,Chesapeake_Bay_retriever,0.047523,True
1918,855459453768019968,https://pbs.twimg.com/media/C98z1ZAXsAEIFFn.jpg,2,Blenheim_spaniel,0.389513,True,Pekinese,0.18822,True,Japanese_spaniel,0.082628,True
444,674644256330530816,https://pbs.twimg.com/media/CVzRXmXWIAA0Fkr.jpg,1,soccer_ball,0.398102,False,basset,0.335692,True,cocker_spaniel,0.072941,True
416,674019345211760640,https://pbs.twimg.com/media/CVqZBO8WUAAd931.jpg,1,collie,0.992732,True,borzoi,0.005043,True,Shetland_sheepdog,0.001725,True


In [35]:
df_predictions['p1'].sort_values().value_counts(sort=False)

p1
Afghan_hound            4
African_crocodile       1
African_grey            1
African_hunting_dog     1
Airedale               12
                       ..
wombat                  4
wood_rabbit             3
wooden_spoon            1
wool                    2
zebra                   1
Name: count, Length: 378, dtype: int64

In [36]:
df_predictions['p2'].sort_values().value_counts(sort=False)

p2
Afghan_hound                       5
African_hunting_dog                1
Airedale                           7
American_Staffordshire_terrier    21
American_alligator                 2
                                  ..
window_screen                      4
window_shade                       1
wire-haired_fox_terrier            4
wombat                             1
wood_rabbit                        1
Name: count, Length: 405, dtype: int64

In [37]:
df_predictions['p3'].sort_values().value_counts(sort=False)

p3
Afghan_hound                       4
African_chameleon                  1
African_grey                       1
Airedale                          11
American_Staffordshire_terrier    24
                                  ..
wombat                             1
wood_rabbit                        3
wool                               3
wreck                              2
zebra                              1
Name: count, Length: 408, dtype: int64

In [38]:
df_predictions.query('img_num > 2').sort_values('img_num', ascending=False)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
144,668623201287675904,https://pbs.twimg.com/media/CUdtP1xUYAIeBnE.jpg,4,Chihuahua,0.708163,True,Pomeranian,0.091372,True,titi,0.067325,False
1161,734787690684657664,https://pbs.twimg.com/media/CjJ9gQ1WgAAXQtJ.jpg,4,golden_retriever,0.883991,True,chow,0.023542,True,Labrador_retriever,0.016056,True
1286,750868782890057730,https://pbs.twimg.com/media/CmufLLsXYAAsU0r.jpg,4,toy_poodle,0.912648,True,miniature_poodle,0.035059,True,seat_belt,0.026376,False
1795,831315979191906304,https://pbs.twimg.com/media/C4lst0bXAAE6MP8.jpg,4,briard,0.982755,True,soft-coated_wheaten_terrier,0.009084,True,Bouvier_des_Flandres,0.004693,True
1790,830097400375152640,https://pbs.twimg.com/media/C4UZLZLWYAA0dcs.jpg,4,toy_poodle,0.442713,True,Pomeranian,0.142073,True,Pekinese,0.125745,True
...,...,...,...,...,...,...,...,...,...,...,...,...
1369,761976711479193600,https://pbs.twimg.com/media/CpMVxoRXgAAh350.jpg,3,Labrador_retriever,0.475552,True,Chesapeake_Bay_retriever,0.082898,True,Staffordshire_bullterrier,0.048464,True
1320,756288534030475264,https://pbs.twimg.com/media/Cn7gaHrWIAAZJMt.jpg,3,conch,0.925621,False,French_bulldog,0.032492,True,tiger_cat,0.006679,False
1303,753026973505581056,https://pbs.twimg.com/media/CnNKCKKWEAASCMI.jpg,3,Pembroke,0.868511,True,Cardigan,0.103708,True,Shetland_sheepdog,0.018142,True
1299,752519690950500352,https://pbs.twimg.com/media/CnF8qVDWYAAh0g1.jpg,3,swing,0.999984,False,Labrador_retriever,0.000010,True,Eskimo_dog,0.000001,True


<a id='json'></a>
### Assessing `df_json`

#### Visual assessment

In [39]:
df_json

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048
...,...,...,...
2349,666049248165822465,41,111
2350,666044226329800704,147,311
2351,666033412701032449,47,128
2352,666029285002620928,48,132


#### Programmatic Assessment

In [40]:
df_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2354 non-null   object
 1   retweet_count   2354 non-null   int64 
 2   favorite_count  2354 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 55.3+ KB


In [41]:
df_json.duplicated().any()

False

In [42]:
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        1611 non-null   object 
 13  doggo                       97 no

<a id='quality'></a>
### Quality issues


#### `df_archive` table
1. The name value for 55 of the tweets is 'a'
2. doggo, floofer, pupper, and puppo are all NaN for most of the records
3. tweet_id is an int not a string
4. timestamp is an object not a datetime
5. in_reply_to_status_id and in_reply_to_user_id is float and not string
6. rating_denominator has values greater than 10
7. There are two more tweets than `df_json` table
8. There are 281 more tweets than in the `df_predictions` table
9. Missing values for in_reply_to_status_id
10. Missing values for in_reply_to_user_id
11. Missing values for retweeted_status_id
12. Missing values for retweeted_status_user_id
13. Missing values for retweeted_status_timestamp
14. Missing values for expanded_urls
15. Missing values for name

#### `df_predictions` table
1. tweet_id is an int not a string
2. There are 281 less tweets than in the `df_archive` table

#### `df_json` table
1. There are two less tweets than the `df_archive` table

<a id='tidiness'></a>
### Tidiness issues

1. doggo, floofer, pupper, and puppo are variable names and should be combined into one column

2. The `df_json` table should be merged into the `df_archive` table because it contains information about the tweet

<a id='cleaning'></a>
## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [43]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

<a id='eda'></a>
## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization

<a id='conclusions'></a>
## Conclusions