# Project: Wrangling and Analyse data from Twitter - WeRateDogs


## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessing">Assessing Data</a></li>
<li><a href="#cleaning">Cleaning Data</a></li>
<li><a href="#analysis">Analysing Data</a></li>
</ul>

<a id='intro'></a>
## Introduction

For this project, done under the Udacity's [Data Analyst Nanodegree](https://www.udacity.com/course/data-analyst-nanodegree--nd002) program, we are going to use Python and its libraries, to gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. Furthermore, we'll perform some data analysis and visualazations.

Details about data that will be used can be found [here](https://classroom.udacity.com/nanodegrees/nd002/parts/af503f34-9646-4795-a916-190ebc82cb4a/modules/14d9f5f1-9e7b-4bfb-97f3-bcdbf4a3699c/lessons/a8085857-3e28-4fc7-aeb8-da64ccbc2e20/concepts/5e3db54a-1a5f-41a6-8e20-fd99f201861d)

<a id='gathering'></a>
## Data Gathering

Data gathering process will result in three (3) following files:
* **twitter_archive_enhanced.csv**: this file has been provided by Udacity and can be manually downloaded from 'resources' section fo the project part of the course.  

* **image_predictions.tsv**: this file is hosted on Udacity's servers and is downloaded programmatically using the Requests library and [this](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv) URL.  

* **tweet-json.txt**: this file can be accessed by querying the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data. As we are encounting some technical challenges to set up as developer account on Tweetet, we are going to use the version provided by the instructor which can be manually downloaded from the course's project page.

In [1]:
### Importing needed libraries
import pandas as pd
import numpy as np
import requests
import json
import matplotlib.pyplot as plt
%matplotlib inline

### 1. Reading archive data into a dataframe

In [2]:
### Reading the 'twitter-archive-enhanced.csv' file into a dataframe
tweet_archive = pd.read_csv('twitter-archive-enhanced.csv')
tweet_archive.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


### 2. Downloading image predictions data and loading it into a datframe

In [3]:
### Downloading 'image-predictions.tsv' file programatically 
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
resp = requests.get(url)
with open('image-predictions.tsv', mode ='wb') as file:
    file.write(resp.content)

### Reading the file into dataframe
img_predictions = pd.read_csv('image-predictions.tsv', sep='\t' )
img_predictions.head(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


### 3. Reading Json data into a dataframe

In [4]:
### Extracting data of interest from 'tweet-json.txt' file

list_dict_interest_data = []

### Loading json data
with open('tweet-json.txt') as json_data:
    for line in json_data:
        data = json.loads(line)
        
### Adding needed data to the list        
        list_dict_interest_data.append({'tweet_id': data['id_str'],
                        'retweet_count': data['retweet_count'],
                        'favorite_count': data['favorite_count'],
                        'full_text': data['full_text'],
                        'display_text_range': data['display_text_range']
                       })

        
tweets_data = pd.DataFrame(list_dict_interest_data, columns = ['tweet_id', 'retweet_count', 'favorite_count', 
                                                               'full_text','display_text_range'])
tweets_data.head(3)

Unnamed: 0,tweet_id,retweet_count,favorite_count,full_text,display_text_range
0,892420643555336193,8853,39467,This is Phineas. He's a mystical boy. Only eve...,"[0, 85]"
1,892177421306343426,6514,33819,This is Tilly. She's just checking pup on you....,"[0, 138]"
2,891815181378084864,4328,25461,This is Archie. He is a rare Norwegian Pouncin...,"[0, 121]"
