# Wrangle that Data

<ul>
    <li><a href="#Gather">Gather</a></li>
    <li>
        <a href="#Assess">Assess</a>
        <ul>
            <li><a href="#Tweet-archive">Tweet archive</a></li>
            <li><a href="#Image-predictions">Image predictions</a></li>
            <li><a href="#Extended-tweets">Extended tweets</a></li>
            <li><a href="#Users">Users</a></li>
            <li><a href="#Problems">Problems</a></li>
        </ul>
    </li>
    <li><a href="#Clean">Clean</a></li>
</ul>

In [1]:
import pandas as pd
import numpy as np
import requests
import json
import tweepy
import os.path
import json
import time
from glob import glob

## Gather

In [2]:
# load secrets
# 'secrets.json' is ignored by git'
with open('secrets.json') as secrets_file:
    secrets = json.load(secrets_file)

In [3]:
tweets_raw = pd.read_csv('twitter-archive-enhanced.csv')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [4]:
image_predictions_file_name = 'image-predictions.tsv'
image_predictions_source_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

In [5]:
# make sure image predictions are downloaded
if not os.path.exists(image_predictions_file_name):
    response = requests.get(image_predictions_source_url, stream=True)
    with open(image_predictions_file_name, mode='w', encoding='utf-8') as dest_file:
        for chunk in response.iter_content(decode_unicode=True):
            # filter out keep-alive new chunks
            if chunk:
                dest_file.write(chunk)

In [6]:
image_predictions_raw = pd.read_csv(image_predictions_file_name, sep='\t')

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [7]:
# get twitter API
consumer_key = secrets['twitter']['consumerApiKey']
consumer_secret = secrets['twitter']['consumerSecret']
access_token = secrets['twitter']['accessToken']
access_secret = secrets['twitter']['accessTokenSecret']

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

twitter_api = tweepy.API(
    auth,
    wait_on_rate_limit=True,
    wait_on_rate_limit_notify=True)

In [8]:
status_ids = (tweets_raw['expanded_urls']
    .str.extract(r'^http[s]?://twitter\.com/dog_rates/status/(\d+)', expand=False)
    .dropna()
    .drop_duplicates()
    .astype(str))
status_count = len(status_ids)
len(status_ids)

1967

In [9]:
already_downloaded_statuses = (
    pd.Series(os.listdir('extended-statuses'))
        .str.extract(r'^(\d+)\.json$', expand=False)
        .dropna())
len(already_downloaded_statuses)

1967

In [10]:
statuses_to_download = set(status_ids) - set(already_downloaded_statuses)
len(statuses_to_download)

0

In [11]:
print('Pulling twitter statuses.')
i = 0
for status_id in statuses_to_download:
    try:
        status = api.get_status(status_id, tweet_mode='extended')
        with open(f'extended-statuses/{status_id}.json', 'w') as target_file:
            json.dump(status._json, target_file, indent=2)
    except tweepy.RateLimitError as rle:
        print(rle)
        time.sleep(60 * 5)
        status = api.get_status(status_id, tweet_mode='extended')
        with open(f'extended-statuses/{status_id}.json', 'w') as target_file:
            json.dump(status._json, target_file, indent=2)
    except Exception as e:
        print(e)
    i += 1
    if i % 100 == 0:
        print(f'Statuses pulled so far: {i}.')
        time.sleep(60)
print('Done.')

Pulling twitter statuses.
Done.


In [12]:
already_downloaded_statuses = (
    pd.Series(os.listdir('extended-statuses'))
        .str.extract(r'^(\d+)\.json$', expand=False)
        .dropna())
len(already_downloaded_statuses)

1967

In [14]:
extended_tweets_arr = []
users_arr = []
for file_path in glob('extended-statuses/*.json'):
    with open(file_path, 'r', encoding='utf-8') as status_file:
        status = json.load(status_file)
    extended_tweets_arr.append({
        'id': status['id'],
        'full_text': status['full_text'],
        'created_at': status['created_at'],
        'source': status['source'],
        'in_reply_to_status_id': status['in_reply_to_status_id'],
        'in_reply_to_user_id': status['in_reply_to_user_id'],
        'retweet_count': status['retweet_count'],
        'favorite_count': status['favorite_count'],
        'user_id': status['user']['id']})
    users_arr.append({
        'id': status['user']['id'],
        'followers_count': status['user']['followers_count'],
        'friends_count': status['user']['friends_count'],
        'listed_count': status['user']['listed_count'],
        'favourites_count': status['user']['favourites_count'],
        'created_at': status['user']['created_at'],
        'statuses_count': status['user']['statuses_count']})
extended_tweets_raw = pd.DataFrame(extended_tweets_arr)
users_raw = pd.DataFrame(users_arr)

## Assess

"I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba"

### Tweet archive

In [29]:
tweets_raw

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [17]:
tweets_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [37]:
tweets_raw['name'].value_counts()

None          745
a              55
Charlie        12
Lucy           11
Oliver         11
Cooper         11
Lola           10
Tucker         10
Penny          10
Bo              9
Winston         9
Sadie           8
the             8
Toby            7
Bailey          7
Daisy           7
an              7
Buddy           7
Milo            6
Dave            6
Koda            6
Bella           6
Rusty           6
Jack            6
Jax             6
Leo             6
Oscar           6
Stanley         6
Scout           6
Bentley         5
             ... 
Bowie           1
by              1
Snoopy          1
Brudge          1
Laela           1
Maks            1
Jebberson       1
Monkey          1
Chesterson      1
Nico            1
Sobe            1
Mimosa          1
Mike            1
Winifred        1
Christoper      1
Ambrose         1
Hermione        1
Richie          1
Aldrick         1
Howie           1
Colin           1
Chuq            1
this            1
Traviss         1
Amélie    

In [55]:
not_names = list(
    tweets_raw['name']
    .where(lambda n: n.str.slice(0, 1) == n.str.slice(0, 1).str.lower())
    .dropna()
    .unique())
not_names.append('None')
print(not_names)

['such', 'a', 'quite', 'not', 'one', 'incredibly', 'mad', 'an', 'very', 'just', 'my', 'his', 'actually', 'getting', 'this', 'unacceptable', 'all', 'old', 'infuriating', 'the', 'by', 'officially', 'life', 'light', 'space', 'None']


In [57]:
tweets_raw['doggo'].value_counts()

None     2259
doggo      97
Name: doggo, dtype: int64

In [58]:
tweets_raw['floofer'].value_counts()

None       2346
floofer      10
Name: floofer, dtype: int64

In [59]:
tweets_raw['pupper'].value_counts()

None      2099
pupper     257
Name: pupper, dtype: int64

In [60]:
tweets_raw['puppo'].value_counts()

None     2326
puppo      30
Name: puppo, dtype: int64

In [61]:
tweets_raw['source'].value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [65]:
tweets_raw['tweet_id'].value_counts().head()

749075273010798592    1
741099773336379392    1
798644042770751489    1
825120256414846976    1
769212283578875904    1
Name: tweet_id, dtype: int64

In [66]:
tweets_raw['in_reply_to_status_id'].value_counts().head()

6.671522e+17    2
8.562860e+17    1
8.131273e+17    1
6.754971e+17    1
6.827884e+17    1
Name: in_reply_to_status_id, dtype: int64

### Image predictions

In [67]:
image_predictions_raw

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [18]:
image_predictions_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [72]:
image_predictions_raw['tweet_id'].value_counts().head()

685532292383666176    1
826598365270007810    1
692158366030913536    1
714606013974974464    1
715696743237730304    1
Name: tweet_id, dtype: int64

### Extended tweets

In [73]:
extended_tweets_raw

Unnamed: 0,created_at,favorite_count,full_text,id,in_reply_to_status_id,in_reply_to_user_id,retweet_count,source,user_id
0,Wed Dec 09 16:52:27 +0000 2015,1563,Rare submerged pup here. Holds breath for a lo...,674632714662858753,,,614,"<a href=""http://twitter.com/download/iphone"" r...",4196983835
1,Mon Dec 21 03:12:08 +0000 2015,2965,This is Tug. He's not required to wear the con...,678774928607469569,,,1009,"<a href=""http://twitter.com/download/iphone"" r...",4196983835
2,Tue Jul 11 00:00:02 +0000 2017,24293,This is Kevin. He's just so happy. 13/10 what ...,884562892145688576,,,4901,"<a href=""http://twitter.com/download/iphone"" r...",4196983835
3,Sat Nov 28 21:34:09 +0000 2015,1264,*screams for a little bit and then crumples to...,670717338665226240,,,523,"<a href=""http://twitter.com/download/iphone"" r...",4196983835
4,Sun Mar 26 01:38:00 +0000 2017,31014,We usually don't rate polar bears but this one...,845812042753855489,,,9504,"<a href=""http://twitter.com/download/iphone"" r...",4196983835
5,Wed Dec 02 21:06:56 +0000 2015,903,This is Bubba. He's a Titted Peebles Aorta. Ev...,672160042234327040,,,381,"<a href=""http://twitter.com/download/iphone"" r...",4196983835
6,Sat Nov 21 02:07:05 +0000 2015,1962,This is Erik. He's fucken massive. But also ki...,667886921285246976,,,1149,"<a href=""http://twitter.com/download/iphone"" r...",4196983835
7,Sun Dec 13 01:12:15 +0000 2015,2404,This is Pepper. She's not fully comfortable ri...,675845657354215424,,,964,"<a href=""http://twitter.com/download/iphone"" r...",4196983835
8,Fri Nov 20 18:35:10 +0000 2015,240,This is a rare Hungarian Pinot named Jessiga. ...,667773195014021121,,,59,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",4196983835
9,Mon Feb 29 16:47:42 +0000 2016,1692,This is Ralphson. He's very confused. Wonderin...,704347321748819968,,,386,"<a href=""http://twitter.com/download/iphone"" r...",4196983835


In [27]:
extended_tweets_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1967 entries, 0 to 1966
Data columns (total 9 columns):
created_at               1967 non-null object
favorite_count           1967 non-null int64
full_text                1967 non-null object
id                       1967 non-null int64
in_reply_to_status_id    23 non-null float64
in_reply_to_user_id      23 non-null float64
retweet_count            1967 non-null int64
source                   1967 non-null object
user_id                  1967 non-null int64
dtypes: float64(2), int64(4), object(3)
memory usage: 138.4+ KB


In [76]:
extended_tweets_raw['id'].value_counts().head()

685532292383666176    1
743510151680958465    1
805487436403003392    1
672466075045466113    1
685315239903100929    1
Name: id, dtype: int64

In [75]:
extended_tweets_raw['in_reply_to_status_id'].value_counts().head()

6.671522e+17    2
8.558181e+17    1
6.753494e+17    1
6.747934e+17    1
6.747522e+17    1
Name: in_reply_to_status_id, dtype: int64

In [78]:
extended_tweets_raw['source'].value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     1928
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       28
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [79]:
extended_tweets_raw['user_id'].value_counts()

4196983835    1967
Name: user_id, dtype: int64

### Users

In [80]:
users_raw.head()

Unnamed: 0,created_at,favourites_count,followers_count,friends_count,id,listed_count,statuses_count
0,Sun Nov 15 21:41:29 +0000 2015,134004,6889373,8,4196983835,4389,7100
1,Sun Nov 15 21:41:29 +0000 2015,134004,6889502,8,4196983835,4475,7100
2,Sun Nov 15 21:41:29 +0000 2015,134004,6889503,8,4196983835,4477,7100
3,Sun Nov 15 21:41:29 +0000 2015,134004,6889443,8,4196983835,4414,7100
4,Sun Nov 15 21:41:29 +0000 2015,134004,6889431,8,4196983835,4413,7100


In [25]:
users_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1967 entries, 0 to 1966
Data columns (total 7 columns):
created_at          1967 non-null object
favourites_count    1967 non-null int64
followers_count     1967 non-null int64
friends_count       1967 non-null int64
id                  1967 non-null int64
listed_count        1967 non-null int64
statuses_count      1967 non-null int64
dtypes: int64(6), object(1)
memory usage: 107.6+ KB


### Problems

- Tidiness:
    - In tweets archive dog "stages" are datapoints, but should be observations (and categorical);
    - tweets archive, extended tweets and image pradictions should be single dataset;
    - (alternatively twitter related information in tweets archive and extended tweets could go into one dataset and dog related information in tweets archive and image pradictions could go to another as this could be seen as two separate concenrs. In our case I think all this is part of same observation, so I decided against splitting data into two datasets);

- Quality:
    - tweets archive:
        - `retwee_status_id` and `retweeted_status_user_id` are in scientific notation (also float64);
        - False dog names (captured in `not_names`);
        - Columns `name`, `doggo`, `floofer`, `pupper`, `puppo` poluted with 'None' instead of np.NaN;
        - `source` would be fine as just inner text of anchor tag
        - `source` could be categorical;
        - `timestamp` and `retweeted_status_timestamp` are objects, but represent datetimes;
    - extended_tweets:
        - `in_reply_to_status_id` and `in_reply_to_user_id` should be integers;
        - `created_at` should be datetime;
        - `source` has same issues as in tweets archive;
        - `user_id` has single unique value;
    - users:
        - Turns out all tweets came from one user. This dataset is useless;

## Clean

### Define

### Act