# Project: Wrangling and Analyze Data

In [1]:
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
%matplotlib inline
import json

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [2]:
twt_archive = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [3]:
URL =" https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

In [4]:
# Read the file
ImagePrediction = pd.read_csv('image-predictions.tsv', sep = '\t')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [5]:
import tweepy
# Downloading the file by Requests library via URL provided 
url = "https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt"

In [11]:
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
ccess_token = 'HIDDEN'
ccess_secret = 'HIDDEN'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)
# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
#tweet_ids = df_1.tweet_id.values
#len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
#count = 0
#fails_dict = {}
#start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
#with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    #for tweet_id in tweet_ids:
     #   count += 1
     #   print(str(count) + ": " + str(tweet_id))
      #  try:
      #      tweet = api.get_status(tweet_id, tweet_mode='extended')
      #      print("Success")
      #      json.dump(tweet._json, outfile)
      #      outfile.write('\n')
      #  except tweepy.TweepError as e:
       #     print("Fail")
       #     fails_dict[tweet_id] = e
       #     pass
#end = timer()
#print(end - start)
#print(fails_dict)

NameError: name 'access_token' is not defined

In [None]:
#I will be using Udacity tweet json file 
myList = []

with open('tweet-json.txt') as file:
    lin = file.readlines()
    for waad in lin :
        load_waad = json.loads(waad)
        
        tweet_id =load_waad['id']
        
        retweet_counting =load_waad['retweet_count']
        
        favorite_counting =load_waad['favorite_count']
        
        myList.append({'tweet_id': tweet_id ,
                        'retweet_counting': retweet_counting,
                        'favorite_counting': favorite_counting})


In [None]:
# Read the file 
twtAPI = pd.DataFrame(myList, columns = ['tweet_id', 'retweet_counting', 'favorite_counting'])
twtAPI.head()

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



## 1.Twitter Archive dataframe

In [None]:
twt_archive.head()

In [None]:
twt_archive.info()

In [None]:
twt_archive.duplicated().sum()

In [None]:
twt_archive.describe()

In [None]:
twt_archive.isnull().sum()

In [None]:
twt_archive.rating_numerator.value_counts()

In [None]:
twt_archive.rating_denominator.value_counts()

In [None]:
# will try to find the rating_denominator that equals 0
twt_archive[twt_archive['rating_denominator'] == 0]

## 2.Image Prediction dataframe

In [None]:
ImagePrediction.head(5)

In [None]:
ImagePrediction.info()

In [None]:
ImagePrediction.duplicated().sum()

In [None]:
ImagePrediction.isnull().sum()

## 3.Twitter API dataframe

In [None]:
twtAPI.head()

In [None]:
twtAPI.info()

In [None]:
twtAPI.isnull().sum()

## Observation Summary:

**1.Twitter Archive dataframe**

`Quality issues :`

1. There is two wrong data type for both (timestamp& retweeted_status_timestamp) should be converted to "datetime" insted of "Object".

2. Many Mising values in multiple columns sush as (in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp,expanded_urls).

3. A wrong data type for(tweet_id) should be converted to "object" insted of "Intger".

4. The "rating denominator" should be always/almost out of 10, but there are numerous other strange numbers, like ( 110,50,170....etc).

5. There are numerous NaN values in the expanded urls column.

6. Some of the tweet does not iclude image.

7. The preducations name need to be captlized.

8. long sources information replace it with someting more readable.

9. There is some tweet apears to be a juste retweet/replies.

`Tidiness issues : `
- "Dog stages > doggo,floofer,pupper,puppo" were discovered to be in multiple column so they should be combined into one column.

**2.Image Prediction dataframe**

`Quality issues :`
- There is missing images since it's only 2057 image out of 2356.
- wrong datatype (tweet_id).

`Tidiness issues : `
- Shall be combined with the twitter archive data frame.

**3.Twitter API dataframe**

`Quality issues :`
- wrong datatype (tweet_id).
- Missing tweets.

`Tidiness issues : `
- Shall be combined with the twitter archive data frame.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data
twt_archive_clean = twt_archive.copy()
ImagePrediction_clean = ImagePrediction.copy()
twtAPI_clean = twtAPI.copy()

In [None]:
print(twt_archive_clean.shape)
print(ImagePrediction_clean.shape)
print(twtAPI_clean.shape)

### Tidiness issues : 

### 1.Multiple dog stage column 
**Define:** "Dog stages > doggo,floofer,pupper,puppo" were discovered to be in multiple column so they will be combined into one column.

In [None]:
twt_archive_clean.loc[(twt_archive_clean[['doggo', 'floofer', 'pupper', 'puppo']] != 'None'
                 ).sum(axis=1) > 1]

In [None]:
# will be merge different dog types into a one column
twt_archive_clean['dog_stages'] = twt_archive_clean[['puppo', 'pupper', 'floofer', 'doggo']].apply(
    lambda x: ','.join(x.astype(str)),axis=1)

twt_archive_clean['dog_stages'] = twt_archive_clean['dog_stages'].str.replace(r'(,None)', repl='')

twt_archive_clean['dog_stages'] = twt_archive_clean['dog_stages'].str.replace(r'(None,)', repl='')

twt_archive_clean.drop(['puppo','pupper','floofer','doggo'], axis=1, inplace=True)

In [None]:
twt_archive_clean.dog_stages.value_counts()

### 2.Multiple dataframe 
**Define:** will be combining all the 3 dataframe into one.

In [None]:
twt_archive_clean = pd.merge(twt_archive_clean,  twtAPI_clean,
                            on = ['tweet_id'], how = 'left')

In [None]:
twt_archive_clean = pd.merge(twt_archive_clean, ImagePrediction_clean,
                            on = ['tweet_id'], how = 'left')

In [None]:
#Test 
twt_archive_clean.head(2)

In [None]:
#Test 
twt_archive_clean.info()

### Quality issues : 

### 1.
**Define:** will be removing all replies and retweet.

In [None]:
twt_archive_clean = twt_archive_clean[pd.isnull(twt_archive_clean.retweeted_status_id)]
twt_archive_clean= twt_archive_clean[pd.isnull(twt_archive_clean.in_reply_to_status_id)]
twt_archive_clean.info()

### 2.
**Define:** will be converting the datatype of"timestamp and retweeted_status_timestamp" from object to datetime.

In [None]:
twt_archive_clean['timestamp']= pd.to_datetime(twt_archive_clean['timestamp'])
twt_archive_clean['retweeted_status_timestamp']= pd.to_datetime(twt_archive_clean['retweeted_status_timestamp'])
#Test
twt_archive_clean.info()

In [None]:
#Test
twt_archive_clean.timestamp.head(3)

### 3.
**Define:** Many Mising values in multiple columns sush as (in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp,expanded_urls) will be removed.


In [None]:
twt_archive_clean = twt_archive_clean.drop(['in_reply_to_user_id','retweeted_status_user_id'
                                            ,'retweeted_status_timestamp','retweeted_status_id','in_reply_to_status_id'] ,1)

twt_archive_clean.columns


### 4.
**Define:** will be converting the datatype of"tweet_id" from intger to object.

In [None]:
twt_archive_clean['tweet_id'] = twt_archive_clean['tweet_id'].astype(str)

In [None]:
twt_archive_clean.info()

### 5. 
**Define:** The "rating denominator" should be always/almost out of 10, but there are numerous other strange numbers, like ( 110,50,170....etc).

In [None]:
a = twt_archive_clean[twt_archive_clean['rating_numerator'] <= 10]
a

### 6.
**Define:** reomving long sources information replace it with someting more readable.



In [None]:
#defining the sources
twt_archive_clean.source.value_counts()

In [None]:
twt_archive_clean.source = twt_archive_clean.source.replace({'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>':'Twitter for iPhone',
                                                                     '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>':'Vine - Make a Scene',
                                                                     '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>': 'Twitter Web Client',
                                                                     '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>': 'TweetDeck'})


In [None]:
#Test 
twt_archive_clean.source.value_counts()

### 7. 
**Define:** There are numerous NaN values in the expanded urls column will be removed.




In [None]:
twt_archive_clean[twt_archive_clean['expanded_urls'].isnull()]

In [None]:
twt_archive_clean.dropna(subset =['expanded_urls'],inplace=True)
twt_archive_clean.expanded_urls.isnull().sum()

### 8.
**Define:** some tweet does not have image ,will be deleting the tweet that does not iclude images.



In [None]:
twt_archive_clean.info()

In [None]:
twt_archive_clean = twt_archive_clean[twt_archive_clean.jpg_url.notnull()]

In [None]:
#Test
twt_archive_clean.info()


### 9. 
**Define:** captlizing the name.

In [None]:
twt_archive_clean.p1 =twt_archive_clean.p1.str.title()
twt_archive_clean.p2 =twt_archive_clean.p2.str.title()
twt_archive_clean.p3 =twt_archive_clean.p3.str.title()

In [None]:
#Test
twt_archive_clean.head(2)

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
twt_archive_clean.to_csv('twitter_archive_master.csv')

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Visualization

In [None]:
b=twt_archive_clean.source.value_counts()
b

In [None]:
v1 = b.plot.bar(color = 'green', fontsize = 10)


plt.title('Common Twitter sources', color = 'black', fontsize = '15')
plt.xlabel('Source', color = 'black', fontsize = '10')
plt.ylabel('counts of tweets', color = 'black', fontsize = '10');

In [None]:
a=twt_archive_clean.dog_stages.value_counts()
a

In [None]:
v2 = a.plot.bar(color = 'blue', fontsize = 10)


plt.title('Common dog stages', color = 'black', fontsize = '15')
plt.xlabel('Dog Stage', color = 'black', fontsize = '10')
plt.ylabel('Counts', color = 'black', fontsize = '10');

In [None]:
arr = a
labels = ['None', 'pupper', 'doggo', 'puppo', 'pupper,doggo','floofer','floofer,doggo','puppo,doggo' ]
plt.pie(arr, labels = labels , radius = 1.40)
plt.legend(title = "Dog Stages")
plt.show() 


### Insights:
1.The first chart shows that the most commn used source by pepole is through the phones with 1932.

2.The first chart shows the least way of using twitter is through deck with 11.

3.The first chart shows that there are significant gaps between commonly used which makes it clear that some people would prefer to use Twitter on their phones.

4.The second pie chart shows the higest dog stage were pupper with 201.

5.The second pie chart shows the loewst dog stage were floofer with 7.

6.From the pie chart we can see that some tweets do not include dog stage information.