# Data Wrangling on WeRateDogs



## Table of Contents
- [Introduction](#intro)
- [Gather](#gather)
- [Assess](#assess)
- [Clean](#clean)

<a id='intro'></a>
## Introduction

The main purpose of this project is to put into practice key concepts learnt during this module. Tasks that will be carrier out on the following notebook are gathering, assessing and cleaning data that has been provided to us using different means. In order to do so Python 3 will be used. Also the solution provided will lean on the following set of python libraries.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import tweepy
import json
from timeit import default_timer as timer
%matplotlib inline

<a id='gather'></a>
## Gather    

The first step in data analysis is data gathering. In this case data gathering will be comprised of the following steps:   
**1.-** Import the tweeter archive for WeRateDogs into the workspace   
**2.-** Download image predictions data from a given URL   
**3.-** Gather data from the Tweeter API to collect more relevant information   

#### 1.- Import the tweeter archive
This file has already been provided as a manual download. As such, it's available locally and just need to import it using pandas read_csv method for reading flat files into a pandas dataframe. Then, output the first row to verify that it has loaded successfully.

In [3]:
archive_df = pd.read_csv("twitter_archive_enhanced.csv")
archive_df.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


#### 2.- Download image predictions data from a given URL   
Unlike the previous file, in this case the file's URL was given. Hence, we download the file programatically as this is less prone to errors and eases reproducibility.    
First download the file using python library **_requests_** and save the information downloaded in a file. Then, import the file similarly to how it was done in step \#1. Also output first row to verify everything has gone correctly. 

In [4]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
image_predictions = requests.get(url)
file_name = url.split('/')[-1]
with open(file_name, mode='w', encoding =image_predictions.encoding) as f:
    f.write(image_predictions.text)
image_df = pd.read_csv(file_name,sep='\t')
image_df.head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


#### 3.- Gather data from Tweeter by accessing their API 
The third step for this data gathering effort is to obtain various pieces of data from Tweeter directly by using their API. Please note that keys and tokens are stored in a separate file that is stored locally and is not part of source control. As such, these fields will need to be edited if this notebook was to used different tweeter app credentials.    
The idea is to query Tweeter's API through python's api library for tweeter (Tweepy). The idea is to get some more information for each of the tweets we have in df_archive dataframe. Then store json content into a txt file. Once the json text file is available parse it and create a new dataframe to work under python environment. 
So let's first define the function that will be used to parse json text file into a dataframe

In [5]:
def parseJson():
    jsonlist = []
    with open('tweet_json.txt','r',encoding='utf-8') as fjsonRead:
        for line in fjsonRead:
            json_dict = json.loads(line)
            tweet_id = json_dict['id']
            retw_count = json_dict['retweet_count']
            fav_count = json_dict['favorite_count']
            jsonlist.append({'tweet_id':tweet_id, 'retw_count':retw_count, 'fav_count':fav_count})
    return pd.DataFrame(jsonlist,columns=['tweet_id', 'retw_count', 'fav_count'])

Downloading information from the API for so many tweets it's a time consuming task. Thus, we would not like to do it every time this notebook is run since this operation can take 30 minutes on its own. Therefore if the file is present, it's assumed that the API has already been queried and data stored in the file. If this was the case, then proceed with parsing the file into a dataframe. Otherwise, query the API and store data in a text file before parsing.

In [7]:
try:
    tweeter_df = parseJson()
except FileNotFoundError as e:
    start = timer()
    with open('tweet_json.txt','w',encoding='utf-8') as f:
        with open('tweetIdsFailed.txt','w',encoding='utf-8') as ffail:
            with open('tweeterKeys.txt','r',encoding='utf-8') as f:
                getData = lambda line: line.split(' ')[0]
                consumer_key = getData(f.readline())
                consumer_secret = getData(f.readline())
                access_token = getData(f.readline())
                access_secret = getData(f.readline())
                auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
                auth.set_access_token(access_token, access_secret)
                api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
                for row in archive_df['tweet_id']:
                    try:
                        tweet = api.get_status(row,tweet_mode='extended')
                        json.dump(tweet._json, f)
                        f.write('\n')
                    except tweepy.TweepError as e:
                        ffail.write(str(row)+'\n')
    end = timer()
    print('Time: {}'.format(end-start))
    tweeter_df = parseJson()
tweeter_df.head(1)

Unnamed: 0,tweet_id,retw_count,fav_count
0,892420643555336193,8366,38195


<a id='assess'></a>
## Assess

#### Quality

#### Tidiness

<a id='clean'></a>
## Clean

### Missing Data

##### Define

##### Code

Now that the required information has been collected via Tweeter API, then we need to join it with data in archive_df

In [None]:
archive_df = archive_df.merge(df_tweeter,on='tweet_id',how='left')
archive_df.head(1)

##### Test