# Data Wrangling project

Your tasks in this project are as follows:

Data wrangling, which consists of:
- Gathering data
- Assessing data
- Cleaning data
- Storing, analyzing, and visualizing your wrangled data

Reporting on :
1. your data wrangling efforts
2. your data analyses and visualizations

Gathering Data for this Project

Gather each of the three pieces of data as described below in a Jupyter Notebook titled wrangle_act.ipynb:

The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually.

The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

**Note: do not include your Twitter API keys, secrets, and tokens in your project submission.**


In [1]:
# Put these at the top of every notebook, to get automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
def pv(variable):
    if(len(variable)>1): print(variable, ":", eval(variable))

In [2]:
### Read The WeRateDogs Twitter archive


In [3]:
!file twitter-archive-enhanced.csv
!head -c200 twitter-archive-enhanced.csv
!tail -c200 twitter-archive-enhanced.csv

twitter-archive-enhanced.csv: HTML document, UTF-8 Unicode text, with very long lines
tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name, Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj,,,,https://twitter.com/dog_rates/status/666020888022790149/photo/1,8,10,None,None,None,None,None


In [4]:
import pandas as pd
df_twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')
df_twitter_archive_enhanced.head(3)
df_twitter_archive_enhanced.shape

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


(2356, 17)

In [12]:
df_twitter_archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

## Read Image prediction

In [6]:
import requests
import os
folder_name = "./"
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
pred_filename = url.split('/')[-1]
response = requests.get(url)
with open(os.path.join(folder_name, pred_filename), mode='wb') as file:
    file.write(response.content)
print("File retreived")
import pandas as pd
df_pred = pd.read_csv(folder_name + pred_filename, sep="\t")
df_pred.head(3)

335079

File retreived


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


### Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
We will call the Twitter API for every Tweet we have in df_twitter_archive_enhanced.tweet_id and add the information to our existing Pandas dataset.



In [None]:
#!pip install tweepy

In [7]:
'''
Create a config file to store the credentials.
'''
# import configparser
# config = configparser.ConfigParser()
# config['TWITTER'] = {'consumer_key': '45',
#                     'consumer_secret': 'yes',
#                     'access_token': 'yes',
#                     'access_secret': '9'}
# with open('config.ini', 'w') as configfile:
#     config.write(configfile)


'\nCreate a config file to store the credentials.\n'

In [11]:
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
from tqdm import tqdm # For progress bar
import configparser

# We will read them from a file that won't be commited.
config = configparser.ConfigParser()
_ = config.read('config.ini');
consumer_key = config['TWITTER']['consumer_key']
consumer_secret = config['TWITTER']['consumer_secret']
access_token = config['TWITTER']['access_token']
access_secret = config['TWITTER']['access_secret']

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

tweet_ids = df_twitter_archive_enhanced.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json-test.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tqdm(tweet_ids):
        count += 1
        #print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            #print("Success")
            _ = json.dump(tweet._json, outfile)
            _ = outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

2356

  1%|          | 20/2356 [00:05<09:33,  4.07it/s]

Fail


  4%|▍         | 96/2356 [00:24<09:12,  4.09it/s]

Fail


  4%|▍         | 102/2356 [00:25<09:11,  4.09it/s]

Fail


  4%|▍         | 105/2356 [00:26<09:13,  4.07it/s]

Fail


  5%|▌         | 119/2356 [00:30<09:09,  4.07it/s]

Fail


  6%|▌         | 133/2356 [00:33<09:09,  4.05it/s]

Fail


  7%|▋         | 156/2356 [00:39<08:49,  4.15it/s]

Fail


 11%|█         | 248/2356 [01:02<08:26,  4.17it/s]

Fail


 11%|█         | 261/2356 [01:06<08:19,  4.20it/s]

Fail


 13%|█▎        | 299/2356 [01:16<08:28,  4.04it/s]

Fail


 16%|█▋        | 383/2356 [01:38<08:11,  4.01it/s]

Fail


 22%|██▏       | 507/2356 [02:10<07:30,  4.11it/s]

Fail


 24%|██▍       | 567/2356 [02:25<07:22,  4.04it/s]

Fail


 33%|███▎      | 785/2356 [03:21<06:23,  4.10it/s]

Fail


 35%|███▍      | 819/2356 [03:31<06:26,  3.98it/s]

Fail


 40%|███▉      | 933/2356 [13:44<05:49,  4.07it/s]    

Fail


 73%|███████▎  | 1727/2356 [17:06<02:34,  4.08it/s]

Fail


100%|██████████| 2356/2356 [31:02<00:00,  4.01it/s]    

1862.885395293939
{888202515573088257: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 873697596434513921: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 872668790621863937: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 872261713294495745: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 869988702071779329: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 866816280283807744: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 861769973181624320: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 845459076796616705: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 842892208864923648: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 837012587749474308: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 827228250799742977: TweepError([{'code': 144, 'mess




In [14]:
print("1862.885395293939 seconds = ",1862.885395293939/60, "minutes")

#{888202515573088257: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 873697596434513921: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 872668790621863937: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 872261713294495745: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 869988702071779329: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 866816280283807744: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 861769973181624320: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 845459076796616705: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 842892208864923648: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 837012587749474308: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 827228250799742977: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 812747805718642688: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 802247111496568832: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 775096608509886464: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 770743923962707968: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 754011816964026368: TweepError([{'code': 144, 'message': 'No status found with that ID.'}]), 680055455951884288: TweepError([{'code': 144, 'message': 'No status found with that ID.'}])}



!ls


1862.885395293939 seconds =  31.048089921565648 minutes
config.ini	       tweet-json.txt		     wrangle_act.ipynb
image-predictions.tsv  twitter-api.py
tweet_json-test.txt    twitter-archive-enhanced.csv


### Load twitter information
TODO: Read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

Let's see what we have in the file :

In [15]:
!file tweet_json-test.txt 
!head -c200 tweet_json-test.txt 
!tail -c200 tweet_json-test.txt 

tweet_json-test.txt: HTML document, ASCII text, with very long lines
{"created_at": "Tue Aug 01 16:23:56 +0000 2017", "id": 892420643555336193, "id_str": "892420643555336193", "full_text": "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut.ull, "is_quote_status": false, "retweet_count": 496, "favorite_count": 2516, "favorited": false, "retweeted": false, "possibly_sensitive": false, "possibly_sensitive_appealable": false, "lang": "en"}


It's JSON data, but with missing opening { at the begining and missing closing } at the end.
So we will read it line by line to process the JSON line-by-line and add them to a Pandas Dataframe.


In [32]:
import pandas as pd

tweets_filename = "tweet_json-test.txt"
import json

# Create an array to receive the data
tweets = []
with open(tweets_filename, 'r') as infile:
    for line in tqdm(infile):
        data = json.loads(line)
        # Create a dictionnary from JSON data and add them to the array
        tweets.append(dict(tweet_id=data["id"],
                               retweet_count=data["retweet_count"],
                               favorite_count=data["favorite_count"]))
# Load the array in a Dataframe
df_tweets = pd.DataFrame(tweets)
# Delete the array to save memory
del tweets
df_tweets.head(3)
df_tweets.shape

# Pandas has a methode "read_json" than could do that for us, with the parameter "lines=True".
# It works with tweet-json.txt but not with tweet_json-test.txt
# df_tweets = pd.read_json(tweets_filename, lines=True)
# df_tweets.head(3)
# df_tweets.shape

2339it [00:00, 20321.10it/s]


Unnamed: 0,favorite_count,retweet_count,tweet_id
0,37667,8213,892420643555336193
1,32353,6072,892177421306343426
2,24372,4015,891815181378084864


(2339, 3)

It works. We know have 2 339 lines with 3 columns.

## Assessing Data for this Project

After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least :
- eight (8) quality issues
- two (2) tidiness issues

To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

In [38]:
df_twitter_archive_enhanced.head(3)
df_tweets.head(3)
df_twitter_all = pd.merge(df_twitter_archive_enhanced, df_tweets, on='tweet_id')
df_twitter_all.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


Unnamed: 0,favorite_count,retweet_count,tweet_id
0,37667,8213,892420643555336193
1,32353,6072,892177421306343426
2,24372,4015,891815181378084864


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,favorite_count,retweet_count
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,37667,8213
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,32353,6072
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,24372,4015


In [47]:
columns = df_twitter_all.columns
to_be_removed = []
for column in columns:
    nb_unique_values = eval('df_twitter_all.' + column + '.nunique()')
    print("Distinct values in", column, ":", nb_unique_values)
    if nb_unique_values > 50:
        to_be_removed.append(column)
to_be_removed

Distinct values in tweet_id : 2339
Distinct values in in_reply_to_status_id : 77
Distinct values in in_reply_to_user_id : 31
Distinct values in timestamp : 2339
Distinct values in source : 4
Distinct values in text : 2339
Distinct values in retweeted_status_id : 167
Distinct values in retweeted_status_user_id : 23
Distinct values in retweeted_status_timestamp : 167
Distinct values in expanded_urls : 2208
Distinct values in rating_numerator : 40
Distinct values in rating_denominator : 18
Distinct values in name : 956
Distinct values in doggo : 2
Distinct values in floofer : 2
Distinct values in pupper : 2
Distinct values in puppo : 2
Distinct values in favorite_count : 1997
Distinct values in retweet_count : 1709


['tweet_id',
 'in_reply_to_status_id',
 'timestamp',
 'text',
 'retweeted_status_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'name',
 'favorite_count',
 'retweet_count']

In [46]:
columns_redux = [item for item in columns if item not in to_be_removed ]
for column in columns_redux:
    print("Distinct values in", column)
    eval('df_twitter_all.' + column + '.value_counts()')

Distinct values in in_reply_to_user_id


4.196984e+09    47
2.195506e+07     2
7.305050e+17     1
2.916630e+07     1
3.105441e+09     1
2.918590e+08     1
2.792810e+08     1
2.319108e+09     1
1.806710e+08     1
3.058208e+07     1
2.625958e+07     1
1.943518e+08     1
3.589728e+08     1
8.405479e+17     1
2.894131e+09     1
2.143566e+07     1
2.281182e+09     1
1.648776e+07     1
4.717297e+09     1
2.878549e+07     1
1.582854e+09     1
4.670367e+08     1
4.738443e+07     1
1.361572e+07     1
1.584641e+07     1
2.068372e+07     1
1.637468e+07     1
1.185634e+07     1
1.198989e+09     1
1.132119e+08     1
7.759620e+07     1
Name: in_reply_to_user_id, dtype: int64

Distinct values in source


<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2205
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      10
Name: source, dtype: int64

Distinct values in retweeted_status_user_id


4.196984e+09    144
4.296832e+09      2
5.870972e+07      1
6.669901e+07      1
4.119842e+07      1
7.475543e+17      1
7.832140e+05      1
4.871977e+08      1
5.970642e+08      1
4.466750e+07      1
1.228326e+09      1
7.992370e+07      1
2.488557e+07      1
7.874618e+17      1
3.638908e+08      1
5.128045e+08      1
8.117408e+08      1
1.732729e+09      1
1.960740e+07      1
3.410211e+08      1
7.124572e+17      1
2.804798e+08      1
1.950368e+08      1
Name: retweeted_status_user_id, dtype: int64

Distinct values in rating_numerator


12      554
11      464
10      459
13      344
9       157
8       102
7        55
14       51
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

Distinct values in rating_denominator


10     2317
50        3
80        2
11        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

Distinct values in doggo


None     2243
doggo      96
Name: doggo, dtype: int64

Distinct values in floofer


None       2329
floofer      10
Name: floofer, dtype: int64

Distinct values in pupper


None      2083
pupper     256
Name: pupper, dtype: int64

Distinct values in puppo


None     2309
puppo      30
Name: puppo, dtype: int64

In [112]:
df_poor_notation = df_twitter_all.loc[df_twitter_all['rating_numerator'] > 420]
df_poor_notation = df_poor_notation[['tweet_id', 'rating_numerator', 'text']]

# Thanks to https://stackoverflow.com/questions/46350705/creating-new-column-in-pandas-dataframe-using-regex
import re
extracted_notation = []    
for tweet in df_poor_notation['text']:
    print("\ntweet :", tweet)
    pattern = re.compile(r'\d+/\d+')
    notations = re.finditer(pattern, tweet)
    if notations:
        for one_notation in notations:
            #notation = notation.group()
            print(one_notation.group())
            pos = tweet.find(one_notation.group())
            start = pos-50
            end = pos+15
            notation = tweet[start:end]
            print("notation:", notation)
    else:
        notation = ""
    extracted_notation.append(notation) # 
    

# df_poor_notation['extract'] = extracted_notation
# df_poor_notation = df_poor_notation[['tweet_id', 'rating_numerator', 'extract']]
# df_poor_notation


tweet : @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
666/10
notation: e ultimately failed. His goodness shines through. 666/10

tweet : @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
960/00
notation: un @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a 
13/10
notation: w you're excited but 960/00 isn't a valid rating, 13/10 is tho

tweet : This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
1776/10
notation: 


In [130]:
import re
# RegExp from http://www.noah.org/wiki/RegEx_Python#URL_regex_pattern
pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')


def remove_url(text):
    notations = re.search(pattern, tweet)
    if notations:
        return text.replace(notations.group(), '')


df_twitter_all['text_clean'] = df_twitter_all['text'].map(lambda x:remove_url(x))


for tweet in df_twitter_all['text_clean']:
    notations = re.search(pattern, tweet)
    if notations:
        print(notations.group())
        new_tweet = tweet.replace(notations.group(), '')
        print(new_tweet)
        pos = tweet.find(one_notation.group())
        start = pos
        end = pos+50
        notation = tweet[start:end]
        print("notation:", notation)
    else:
        notation = ""

TypeError: expected string or bytes-like object

In [113]:
df_twitter_tmp = df_twitter_all.query("rating_numerator > 10 and rating_numerator <= 15 and rating_denominator == 10")
df_twitter_tmp.rating_numerator.value_counts()
df_twitter_tmp.rating_denominator.value_counts()

12    554
11    463
13    344
14     51
15      2
Name: rating_numerator, dtype: int64

10    1414
Name: rating_denominator, dtype: int64

In [33]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 3 columns):
favorite_count    2339 non-null int64
retweet_count     2339 non-null int64
tweet_id          2339 non-null int64
dtypes: int64(3)
memory usage: 54.9 KB


In [34]:
df_tweets.describe()

Unnamed: 0,favorite_count,retweet_count,tweet_id
count,2339.0,2339.0,2339.0
mean,7877.033348,2893.996152,7.42162e+17
std,12207.163623,4880.927842,6.828729e+16
min,0.0,1.0,6.660209e+17
25%,1363.5,581.5,6.783378e+17
50%,3423.0,1351.0,7.186133e+17
75%,9645.5,3371.5,7.986907e+17
max,162630.0,82747.0,8.924206e+17


In [35]:
df_tweets.lang.value_counts()
df_tweets.lang.nunique()
df_tweets.in_reply_to_screen_name.nunique()

AttributeError: 'DataFrame' object has no attribute 'lang'

In [None]:
### Read 

In [None]:
df_pred.info()

Check : All lines have values in them.
And their types where correctly identified.

In [None]:
df_pred.describe()

We saw that, as expected, the mean of the confidence decrease from first prediction to the third.
The min and max is wel between 0 and 1.

In [None]:
df_pred.p1.value_counts()

We saw that some entry have nothing to with dogs : minibus, fountain, desktop_computer, ...
But we could identify them with the column "p1_dog".

In [None]:
df_pred.p1_dog.value_counts()

So, off 2 750 picture, 543 are identified as not dog.

In [None]:
df_pred.query("p1_dog == True")["p1"].value_counts()


    eight (8) quality issues
    two (2) tidiness issues

--------------

expanded_urls may contain up to 4 comma separated url to the images.

in_reply_to_user_id is in int64, problem ?

source do not seems of any use, I think there is not the values we expected in it.

rating_denominator containt some values different from "10". But looking at the concerned tweets these are "real" notation.

rating_numerator have values from 0 to 420. Only few of them are extraction mistake, like this one where a 9.5/10 became a 5/10 :
"I've been told there's a slight possibility he's checking his mirror. We'll bump to 9.5/10. Still a menace"
Or missreading when more than one notation :
"@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho"

@dog_rates sometime goes out of it classic notation between 11 to 15 :
"His goodness shines through. 666/10"
"He's quite simply America af. 1776/10"

So I've manually corrected the few real errors :

But if we want to do calculus one the notation, it seems fair to remove the outlier made of jokes to keep only the common notation from 11 to 15. It could be simply done with :
df_twitter_tmp = df_twitter_all.query("rating_numerator > 10 and rating_numerator <= 15 and rating_denominator == 10")


Some tweet have there short url included, let's remove them :



----------------


The ratings probably aren't all correct.
Same goes for the dog names
and probably dog stages




You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.

Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
Cleaning includes merging individual pieces of data according to the rules of tidy data (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) .
The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.


## Storing, Analyzing, and Visualizing Data for this Project

Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

In [None]:
pd.to_csv('twitter_archive_master.csv')

In [None]:
# Store in SQLite


## Reporting for this Project

Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.