# <a id='toc1_'></a>[Project: Wrangling and Analyze Data](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Project: Wrangling and Analyze Data](#toc1_)    
  - [Data Gathering](#toc1_1_)    
    - [Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)](#toc1_1_1_)    
    - [Use the Requests library to download the tweet image prediction (image_predictions.tsv)](#toc1_1_2_)    
    - [Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)](#toc1_1_3_)    
  - [Assessing Data](#toc1_2_)    
    - [Quality issues](#toc1_2_1_)    
    - [Tidiness issues](#toc1_2_2_)    
  - [Cleaning Data](#toc1_3_)    
    - [Issue #1:](#toc1_3_1_)    
      - [Define:](#toc1_3_1_1_)    
      - [Code](#toc1_3_1_2_)    
      - [Test](#toc1_3_1_3_)    
    - [Issue #2:](#toc1_3_2_)    
      - [Define](#toc1_3_2_1_)    
      - [Code](#toc1_3_2_2_)    
      - [Test](#toc1_3_2_3_)    
  - [Storing Data](#toc1_4_)    
  - [Analyzing and Visualizing Data](#toc1_5_)    
    - [Insights:](#toc1_5_1_)    
    - [Visualization](#toc1_5_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Data Gathering](#toc0_)
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.


In [None]:
#import all the libraries necessary in this workbook
import pandas as pd
import numpy as np
import requests
import os

### <a id='toc1_1_1_'></a>[Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)](#toc0_)

In [None]:
df1=pd.read_csv("../data/twitter-archive-enhanced.csv", sep= ",",)
df1.head(1);

In [None]:
df1.info()

In [None]:
#Change tweet_id from int to str
df1['tweet_id']=df1['tweet_id'].apply(str)

In [None]:
#Create a list of member ids contained in this dataset
tweet_id_list= df1.tweet_id.tolist()
print(tweet_id_list);

In [None]:
len(tweet_id_list)

### <a id='toc1_1_2_'></a>[Use the Requests library to download the tweet image prediction (image_predictions.tsv)](#toc0_)

In [None]:
#Read the file downloaded
df2 = pd.read_csv("../data/image_predictions.tsv", sep='\t')
df2.head()

In [None]:
df2.info()

### <a id='toc1_1_3_'></a>[Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)](#toc0_)

In [None]:
#Getting started on the Twitter API

import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions

import sys
sys.path.append('../code')

import credentials

auth = OAuthHandler(credentials.consumer_key, credentials.consumer_secret)
auth.set_access_token(credentials.access_token, credentials.access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [14]:
#Reading json data from the saved txt file
from pprint import pprint
import json

df_list=[]

path = 'C:\\Users\\catar\\data-projects\\Github\\udacity-wrangle-and-analyse-data\\data\\tweet_json.txt'

with open(path, encoding='utf-8') as f:
    for line in f:
        df_list.append(json.loads(line))

#Printing the first line of the text file        
df_list[0]

{'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
 'truncated': False,
 'display_text_range': [0, 85],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'medium': {'w': 540, 'h': 528, 'resize': 'fit'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'large': {'w': 

In [15]:
#Print the list of the first line keys (which should be the same for all the lines)
sorted(df_list[0].keys())

['contributors',
 'coordinates',
 'created_at',
 'display_text_range',
 'entities',
 'extended_entities',
 'favorite_count',
 'favorited',
 'full_text',
 'geo',
 'id',
 'id_str',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id',
 'in_reply_to_user_id_str',
 'is_quote_status',
 'lang',
 'place',
 'possibly_sensitive',
 'possibly_sensitive_appealable',
 'retweet_count',
 'retweeted',
 'source',
 'truncated',
 'user']

In [18]:
#create dataframe from list of dictionaires
df3=pd.DataFrame(df_list)

In [19]:
#select only the columns we want to extract and store it in a dataframe (dt_3)
df3=df3[['id_str','favorite_count','retweet_count']]
df3

Unnamed: 0,id_str,favorite_count,retweet_count
0,892420643555336193,33626,6953
1,892177421306343426,29162,5255
2,891815181378084864,21940,3462
3,891689557279858688,36701,7171
4,891327558926688256,35098,7700
...,...,...,...
2023,671485057807351808,673,195
2024,671390180817915904,1274,647
2025,671362598324076544,974,257
2026,671357843010908160,351,133


>> [!IMPORTANT] After gathering saving the data in files I changed some code cells to markdown


## <a id='toc1_2_'></a>[Assessing Data](#toc0_)
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### <a id='toc1_2_1_'></a>[Quality issues](#toc0_)
1.

2.

3.

4.

5.

6.

7.

8.

### <a id='toc1_2_2_'></a>[Tidiness issues](#toc0_)
1.

2.

## <a id='toc1_3_'></a>[Cleaning Data](#toc0_)
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### <a id='toc1_3_1_'></a>[Issue #1:](#toc0_)

#### <a id='toc1_3_1_1_'></a>[Define:](#toc0_)

#### <a id='toc1_3_1_2_'></a>[Code](#toc0_)

#### <a id='toc1_3_1_3_'></a>[Test](#toc0_)

### <a id='toc1_3_2_'></a>[Issue #2:](#toc0_)

#### <a id='toc1_3_2_1_'></a>[Define](#toc0_)

#### <a id='toc1_3_2_2_'></a>[Code](#toc0_)

#### <a id='toc1_3_2_3_'></a>[Test](#toc0_)

## <a id='toc1_4_'></a>[Storing Data](#toc0_)
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## <a id='toc1_5_'></a>[Analyzing and Visualizing Data](#toc0_)
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### <a id='toc1_5_1_'></a>[Insights:](#toc0_)
1.

2.

3.

### <a id='toc1_5_2_'></a>[Visualization](#toc0_)