# Project: Wrangling and Analyze Data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessing">Assessing Data</a></li>
<li><a href="#cleaning">Cleaning Data</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Descriptions

#### WeRateDogs enhanched twitter archive data

This dataset is provided by Udacity and is actually a subset of the 5000+ tweets from the WeRateDogs Twitter archive that have been filtered for tweets with ratings only.

|        **Features**        |                                                             **Description**                                                            |
|:--------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------:|
|          tweet_id          |                                                        Unique id for each tweet                                                        |
|    in_reply_to_status_id   |                                               The tweet id for the tweet the reply is to                                               |
|     in_reply_to_user_id    |                                                    The user id that the reply is to                                                    |
|          timestamp         |                                                     Date and time tweet was posted                                                     |
|           source           |                                                      Source tweet was posted from                                                      |
|            text            |                                                        Text content of the tweet                                                       |
|     retweeted_status_id    |                                                   ID of the tweet that was retweeted                                                   |
|  retweeted_status_user_id  |                                                   User ID of user from original tweet                                                  |
| retweeted_status_timestamp |                                                    Date and time retweet was posted                                                    |
|        expanded_urls       |               URLs for links to media inside in the tweet (this can be video, photo, urls to other tweets, or other URLs)              |
|      rating_numerator      | Rating for dog (Can be higher than denominator per this [unique rating system](https://knowyourmeme.com/memes/theyre-good-dogs-brent)) |
|     rating_denominator     |                                Top of scale for rating of dog (although can be lower than the numerator)                               |
|            name            |                                                             Name of the dog                                                            |
|            doggo           |                              A big pupper, usually older. A pupper that appears to have its life in order.                             |
|           floofer          |                        Any dog really. However, this label is commonly given to dogs with seemingly excess fur.                        |
|           pupper           |                                                     A small doggo, usually younger.                                                    |
|            puppo           |                  A transitional phase between pupper and doggo. Easily understood as the dog equivalent of a teenager.                 |

#### Tweet image prediction data

This dataset is the result of running every image from the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs. This dataset gives the top three predictions only.

| **Features** |                                                         **Description**                                                        |
|:------------:|:------------------------------------------------------------------------------------------------------------------------------:|
|   tweet_id   |                                                    Unique id for each tweet                                                    |
|    jpg_url   |                                                        URL for the image                                                       |
|    img_num   | The image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images). |
|      p1      |                                    The algorithm's #1 prediction for the image in the tweet                                    |
|    p1_conf   |                                       How confident the algorithm is in its #1 prediction                                      |
|    p1_dog    |                                       Whether or not the #1 prediction is a breed of dog                                       |
|      p2      |                                          The algorithm's second most likely prediction                                         |
|    p2_conf   |                                       How confident the algorithm is in its #2 prediction                                      |
|    p2_dog    |                                       Whether or not the #2 prediction is a breed of dog                                       |
|      p3      |                                          The algorithm's third most likely prediction                                          |
|    p3_conf   |                                       How confident the algorithm is in its #3 prediction                                      |
|    p3_dog    |                                       Whether or not the #3 prediction is a breed of dog                                       |

#### Twitter API results

This is the resulting data from the Twitter API code provided by Udacity which shows the retweet count and favorite count for each tweet from the WeRateDogs enhanced Twitter archive dataset.

|  **Features**  |                **Description**                |
|:--------------:|:---------------------------------------------:|
|    tweet_id    |            Unique id for each tweet           |
|  retweet_count |            Total number of retweets           |
| favorite_count | Total number of times the tweet was favorited |

### Import necessary libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json
import tweepy
from tweepy import OAuthHandler
from timeit import default_timer as timer
import os
import requests
import lxml
from dotenv import load_dotenv

%matplotlib inline

Set options for showing dataframes

In [2]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100)

<a id='gathering'></a>
## Data Gathering

<ul>
<li><a href="#twitter_archive">WeRateDogs Twitter archive data</a></li>
<li><a href="#img_predictions">Tweet image prediction data</a></li>
<li><a href="#tweet_json">Twitter API data</a></li>
</ul>

<a id='twitter_archive'></a>
### Read in the WeRateDogs Twitter archive data downloaded directly from Udacity

In [3]:
twitter_archive = pd.read_csv('./data/twitter-archive-enhanced.csv')

<a id='img_predictions'></a>
### Use the Requests library to download the tweet image prediction data

#### Creating functions to be used for creating folders, files, and downloading data using the Requests library

In [4]:
def get_data_content(url):
    """
    Return the content of the response of provided url

    This script requires that `requests` be installed within the Python
    environment you are running this script in.

    Parameter
    -----------
    url : str

    Returns
    ----------
    byte
        Response from url as bytes
    """
    
    response = requests.get(url)
    return response.content

def create_folder(folder_name):
    """
    Creates a folder in immediate path if the folder name does not already exist

    This script requires that `os` be installed within the Python environment you 
    are running this script in.

    Parameter
    -----------
    folder_name : str
    """
    
    # Checks directory for folder, folder_name
    if not os.path.exists(folder_name):
        os.makedirs(folder_name) # If folder_name was not found, makes new folder, folder_name

def create_file(url, folder_name):
    """
    Creates a file with the contents from `get_data_content` in folder if the file 
    name does not already exist

    This script requires that `os` be installed within the Python environment you 
    are running this script in.

    Parameters
    -----------
    url : str
    folder_name : str
    """
    
    file_name = url.split('/')[-1] # Splits the url string at the last / and keeps the text to the right of it
    # If file_name does not exists inside folder_name, save contents of response
    # response.content is stored as bytes so mode argument is set to 'wb' for write binary
    with open(os.path.join(folder_name,
                           file_name),
                           mode='wb') as file:
        if not os.path.exists(file_name):
            file.write(get_data_content(url))

def create_data_file(url, folder_name):
    """
    Create file with contents from the response of a URL inside a folder

    Parameters
    -----------
    url : str
    folder_name : str    
    """
    
    create_folder(folder_name)
    create_file(url, folder_name)

In [5]:
create_data_file('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv',
                 'data')

In [6]:
img_predictions = pd.read_csv('./data/image-predictions.tsv', delimiter='\t')

<a id='tweet_json'></a>
### Use the Tweepy library to query additional data via the Twitter API

Below is the code provided by Udacity to students that are unable to use the API (which will be all students now that Twitter has changed their API access). 

**Please note:**
> Twitter no longer allows v1 API access. The free plan for v2 does not include lookup of tweets. The basic plan for v2 can retrieve up to 10,000 tweets per month at a rate limit of 15 requests/15 minutes. However, the basic plan costs $100/month
> 
> I had already created a dotenv file to store my API credentials to keep them hidden
> 
> The cell below is in Raw format to avoid it being run since the code assumes using v1.1 of Twitter's API and I was only able to get v2 credentials

Since I am unable to use the Twitter API to gather the data I need I will be using the resulting data of the code above as provided by Udacity.

> I will be doing this following the same method to get the image-predictions.tsv data

In [7]:
create_data_file('https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt',
                 'data')

In [8]:
df_list = [] # Creates empty list to build DataFrame later
with open('./data/tweet-json.txt', encoding='utf-8') as file:
    for line in file:
        tweet = json.loads(line) # Store line as JSON format to make the process of accessing data easier than using slice method
        tweet_id = tweet['id_str'] # Get tweet id as string
        retweet_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']
        # Add dict to list to use for creating a DataFrame
        df_list.append({'tweet_id': tweet_id,
                        'retweet_count': retweet_count,
                        'favorite_count': favorite_count})

# Create DataFrame using list created from file
tweet_json = pd.DataFrame(df_list, columns = ['tweet_id', 'retweet_count', 'favorite_count'])

<a id='assessing'></a>
## Assessing Data

<ul>
<li><a href="#archive">Assessing twitter_archive</a></li>
<li><a href="#predictions">Assessing img_predictions</a></li>
<li><a href="#json">Assessing tweet_json</a></li>
<li><a href="#quality">Quality Issues</a></li>
<li><a href="#Tidiness">Tidiness Issues</a></li>
</ul>
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

In [9]:
# Reusing my own function from the _Investigate a dataset_ project from earlier in this nanodegree
# I did update the docstring and ensured it was code that was used more than once
def check_for_dupes(data, col):
    """
    Checks for duplicate values in each column of a dataset (dupe or dupes)
    Returns column name, number of duplicated values, and list of duplicated values with frequency of each value (if applicable)

    Parameter
    -----------
    data : DataFrame
    col : Series from DataFrame
    """
    
    dupe_count = data[col].duplicated().sum()
    if dupe_count == 0:
        return f'{col} has {dupe_count} dupes\n'
    elif dupe_count == 1:
        return f'{col} has {dupe_count} dupe:\n{data[col].value_counts()}\n'
    else:
        return f'{col} has {dupe_count} dupes:\n{data[col].value_counts()}\n'

<a id='archive'></a>
### Assessing `twitter_archive`

#### Visual assessment

In [10]:
twitter_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq,,,,https://twitter.com/dog_rates/status/666049248165822465/photo/1,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx,,,,https://twitter.com/dog_rates/status/666044226329800704/photo/1,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR,,,,https://twitter.com/dog_rates/status/666033412701032449/photo/1,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI,,,,https://twitter.com/dog_rates/status/666029285002620928/photo/1,7,10,a,,,,


##### Visual Assessment notes:

* Names are missing
* Names are 'a'
* Rows are missing a classification for one of doggo, floofer, pupper, or puppo
* expanded_urls have multiple URLs separated by commas, some of the URLs are for photos while others are for videos or other links.

#### Programmatic Assessment

In [11]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        1611 non-null   object 
 13  doggo                       97 no

* 78 rows are replies (will need to ensure these are not updates to ratings)
* 181 rows are retweets (which for tweets can be considered duplicated data since it is a repost of original tweet) and should not be included in the analysis of this dataset
* 59 rows are missing values in the exapnded_urls column
* Name is missing value for 745 rows
* Doggo, floofer, pupper, and puppo are not completed for a majority of the records
* in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id are float and should be string
* timestamp and retweeted_status_timestamp are object and should be string

In [12]:
# Count of duplicated rows
twitter_archive.duplicated().sum()

0

In [13]:
# Print out the duplicate values for each column in the data set
for col in twitter_archive.columns:
    print(check_for_dupes(twitter_archive, col), '\n', '--------------------------------------------', '\n')

tweet_id has 0 dupes
 
 -------------------------------------------- 

in_reply_to_status_id has 2278 dupes:
in_reply_to_status_id
6.671522e+17    2
8.862664e+17    1
6.920419e+17    1
6.827884e+17    1
6.842229e+17    1
6.844811e+17    1
6.849598e+17    1
6.855479e+17    1
6.860340e+17    1
6.903413e+17    1
6.924173e+17    1
6.780211e+17    1
6.935722e+17    1
6.936422e+17    1
6.706684e+17    1
6.753494e+17    1
6.964887e+17    1
7.030419e+17    1
7.044857e+17    1
6.813394e+17    1
6.765883e+17    1
7.079801e+17    1
6.757073e+17    1
6.689207e+17    1
6.678065e+17    1
6.693544e+17    1
6.715449e+17    1
6.715610e+17    1
6.737159e+17    1
6.658147e+17    1
6.744689e+17    1
6.747400e+17    1
6.747522e+17    1
6.717299e+17    1
6.747934e+17    1
6.749998e+17    1
6.754971e+17    1
6.758457e+17    1
7.032559e+17    1
7.291135e+17    1
8.816070e+17    1
8.320875e+17    1
8.380855e+17    1
8.381455e+17    1
8.406983e+17    1
7.590995e+17    1
8.476062e+17    1
8.482121e+17    1
8.503

tweet_id values are all unique
expanded_urls has 137 dupes which might mean they are duplicated by retweets and/or replies. Although this number is lower than the number of tweets with retweets. This will require further investigation and will also require cleaning/transforming the data.

In [40]:
# Find all the unique values in each column
for col in twitter_archive.columns:
    print(f'{col.upper()} has {twitter_archive[col].nunique()} unique values:\n  {twitter_archive[col].sort_values().unique()} \n\n ---------------------------------------------------------\n')

TWEET_ID has 2356 unique values:
  [666020888022790149 666029285002620928 666033412701032449 ...
 891815181378084864 892177421306343426 892420643555336193] 

 ---------------------------------------------------------

IN_REPLY_TO_STATUS_ID has 77 unique values:
  [6.65814697e+17 6.67065536e+17 6.67152164e+17 6.67806455e+17
 6.68920717e+17 6.69354383e+17 6.70668383e+17 6.71544874e+17
 6.71561002e+17 6.71729907e+17 6.73715862e+17 6.74468881e+17
 6.74739953e+17 6.74752233e+17 6.74793399e+17 6.74999808e+17
 6.75349384e+17 6.75497103e+17 6.75707330e+17 6.75845657e+17
 6.76588346e+17 6.78021116e+17 6.81339449e+17 6.82788442e+17
 6.84222868e+17 6.84481075e+17 6.84959799e+17 6.85547936e+17
 6.86034025e+17 6.90341254e+17 6.91416866e+17 6.92041935e+17
 6.92417313e+17 6.93572216e+17 6.93642232e+17 6.96488711e+17
 7.03041950e+17 7.03255936e+17 7.04485745e+17 7.07980066e+17
 7.29113531e+17 7.33109485e+17 7.38411920e+17 7.46885919e+17
 7.47648654e+17 7.50180499e+17 7.59099524e+17 7.63865175e+17
 7.6

rating_numerator has some outliers as values and will need to be researched.

rating_denominator should normally be 10 but has some other values as well. This will require further research.

The list of names that appear to be incorrect and will need to be cleaned:

<ul>
    <li>'a'</li>
    <li>'actually'</li>
    <li>'all'</li>
    <li>'an'</li>
    <li>'by'</li>
    <li>'getting'</li>
    <li>'his'</li>
    <li>'incredibly'</li>
    <li>'infuriating'</li>
    <li>'just'</li>
    <li>'life'</li>
    <li>'light'</li>
    <li>'mad'</li>
    <li>'my'</li>
    <li>'not'</li>
    <li>'officially'</li>
    <li>'old'</li>
    <li>'one'</li>
    <li>'quite'</li>
    <li>'space'</li>
    <li>'such'</li>
    <li>'the'</li>
    <li>'this'</li>
    <li>'unacceptable'</li>
    <li>'very'</li>
</ul>

<a id='predictions'></a>
### Assessing `img_predictions`

#### Visual assessment

In [22]:
img_predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


There are less rows than in the `twitter_archive` which may mean that not all the tweets have images

#### Programmatic Assessment

In [24]:
img_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


No missing values for any of the columns

In [42]:
# Check for duplicated rows
img_predictions.duplicated().sum()

0

In [26]:
# Print out the duplicate values for each column in the data set
for col in img_predictions.columns:
    print(check_for_dupes(img_predictions, col), '\n', '--------------------------------------------', '\n')

tweet_id has 0 dupes
 
 -------------------------------------------- 

jpg_url has 66 dupes:
jpg_url
https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg                                            2
https://pbs.twimg.com/media/Cq9guJ5WgAADfpF.jpg                                            2
https://pbs.twimg.com/ext_tw_video_thumb/807106774843039744/pu/img/8XZg1xW35Xp2J6JW.jpg    2
https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg                                            2
https://pbs.twimg.com/media/CsrjryzWgAAZY00.jpg                                            2
                                                                                          ..
https://pbs.twimg.com/media/CXrmMSpUwAAdeRj.jpg                                            1
https://pbs.twimg.com/media/CXrawAhWkAAWSxC.jpg                                            1
https://pbs.twimg.com/media/CXrIntsUsAEkv0d.jpg                                            1
https://pbs.twimg.com/media/CXqcOHCUQAAugTB.jpg               

In [43]:
# Find all the unique values in each column
for col in img_predictions.columns:
    print(f'{col.upper()} has {img_predictions[col].nunique()} unique values:\n  {img_predictions[col].sort_values().unique()} \n\n ---------------------------------------------------------\n')

TWEET_ID has 2075 unique values:
  [666020888022790149 666029285002620928 666033412701032449 ...
 891815181378084864 892177421306343426 892420643555336193] 

 ---------------------------------------------------------

JPG_URL has 2009 unique values:
  ['https://pbs.twimg.com/ext_tw_video_thumb/674805331965399040/pu/img/-7bw8niVrgIkLKOW.jpg'
 'https://pbs.twimg.com/ext_tw_video_thumb/675354114423808004/pu/img/qL1R_nGLqa6lmkOx.jpg'
 'https://pbs.twimg.com/ext_tw_video_thumb/675740268751138818/pu/img/dVaVeFAVT-lk_1ZV.jpg'
 ... 'https://pbs.twimg.com/tweet_video_thumb/CeBym7oXEAEWbEg.jpg'
 'https://pbs.twimg.com/tweet_video_thumb/CeGGkWuUUAAYWU1.jpg'
 'https://pbs.twimg.com/tweet_video_thumb/CtTFZZfUsAE5hgp.jpg'] 

 ---------------------------------------------------------

IMG_NUM has 4 unique values:
  [1 2 3 4] 

 ---------------------------------------------------------

P1 has 378 unique values:
  ['Afghan_hound' 'African_crocodile' 'African_grey' 'African_hunting_dog'
 'Airedale' 'Am

tweet_id values are all unique

jpg_url has duplicated values which means the same image was analyzed more than once. This will need to be investigated further.

In [27]:
# Find one of the duplicated jpg_url values
img_predictions.query('jpg_url == "https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg"')

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
800,691416866452082688,https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg,1,Lakeland_terrier,0.530104,True,Irish_terrier,0.197314,True,Airedale,0.082515,True
1624,803692223237865472,https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg,1,Lakeland_terrier,0.530104,True,Irish_terrier,0.197314,True,Airedale,0.082515,True


In [44]:
# Use first tweet_id to find tweet details for duplicated jpg_url
twitter_archive.query('tweet_id == 691416866452082688')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1511,691416866452082688,,,2016-01-25 00:26:41 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I present to you... Dog Jesus. 13/10 (he could be sitting on a rock but I doubt it) https://t.co/fR1P3g5I6k,,,,https://twitter.com/dog_rates/status/691416866452082688/photo/1,13,10,,,,,


This is an orginal tweet

In [45]:
# Use second tweet_id to find tweet details for duplicated jpg_url
twitter_archive.query('tweet_id == 803692223237865472')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
555,803692223237865472,,,2016-11-29 20:08:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: I present to you... Dog Jesus. 13/10 (he could be sitting on a rock but I doubt it) https://t.co/fR1P3g5I6k,6.914169e+17,4196984000.0,2016-01-25 00:26:41 +0000,"https://twitter.com/dog_rates/status/691416866452082688/photo/1,https://twitter.com/dog_rates/status/691416866452082688/photo/1",13,10,,,,,


This is a retweet and is causing the duplication.

<a id='json'></a>
### Assessing `tweet_json`

#### Visual assessment

In [36]:
tweet_json

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048
...,...,...,...
2349,666049248165822465,41,111
2350,666044226329800704,147,311
2351,666033412701032449,47,128
2352,666029285002620928,48,132


This dataset has fewer values than the twitter_archive dataset and will need to be investigated further to understand why.

#### Programmatic Assessment

In [37]:
tweet_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2354 non-null   object
 1   retweet_count   2354 non-null   int64 
 2   favorite_count  2354 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 55.3+ KB


No missing data

In [46]:
tweet_json.duplicated().sum()

0

<a id='quality'></a>
### Quality issues


#### `twitter_archive` table
1. The name value for 55 of the tweets is 'a'
2. doggo, floofer, pupper, and puppo are all NaN for most of the records
3. tweet_id is an int not a string
4. timestamp is an object not a datetime
5. in_reply_to_status_id and in_reply_to_user_id is float and not string
6. rating_denominator has values greater than 10
7. There are two more tweets than `tweet_json` table
8. There are 281 more tweets than in the `img_predictions` table
9. Missing values for in_reply_to_status_id
10. Missing values for in_reply_to_user_id
11. Missing values for retweeted_status_id
12. Missing values for retweeted_status_user_id
13. Missing values for retweeted_status_timestamp
14. Missing values for expanded_urls
15. Missing values for name

#### `img_predictions` table
1. tweet_id is an int not a string
2. There are 281 less tweets than in the `twitter_archive` table
3. Duplicated jpg_url values

#### `tweet_json` table
1. There are two less tweets than the `twitter_archive` table

<a id='tidiness'></a>
### Tidiness issues

1. doggo, floofer, pupper, and puppo are variable names and should be combined into one column

2. The `tweet_json` table should be merged into the `twitter_archive` table because it contains information about the tweet

3. extended_urls have some values with mutliple values separated by commas and should be separated into individual columns

<a id='cleaning'></a>
## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [39]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

<a id='eda'></a>
## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization

<a id='conclusions'></a>
## Conclusions