# Reporting: wragle_report

## Data Gathering
In the cell below, I gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.

### Download data
      a. twitter_archive_enhanced.csv
      b. image_predictions.tsv
      c. tweet_json.txt

### Quality issues
1. Column tweet_id is wrong type. Convert to Int

2. Columns: retweet_count and favorite_count should be integers, not floats. Convert to Int.

3. Columns unused for analysis should be drop (text, in_reply_to_status_id, in_reply_to_user_id)

4. The timestamp column has wrong type. Convert to datetime type.

5. Some tweet are just a retweet from origin

6. Column expanded_urls has rows with null values. 

7. Source column is in HTML-formatted string, not a normal string

8. Dog names not corrected

### Tidiness issues
9. Columns: doggo, floofer, pupper, and puppo should be replace with one category column name dog_type.

10. Twitter API table and Image prediction table should be merged into Twitter archive table

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

### Issue #1:

#### Define: Column tweet_id is wrong type. Convert to Int

#### Code

In [101]:
df_twitter_archive.tweet_id = df_twitter_archive.tweet_id.astype('int64')

#### Test

In [102]:
type(df_twitter_archive.tweet_id[0])

numpy.int64

### Issue #2:

#### Define: retweet_count and favorite_count should be integers, not floats. Convert to Int

#### Code

In [103]:
df_tweet_json.favorite_count = df_tweet_json.favorite_count.astype(int)
df_tweet_json.retweet_count = df_tweet_json.retweet_count.astype(int)

#### Test

In [104]:
type(df_tweet_json.favorite_count[0])

numpy.int64

In [105]:
type(df_tweet_json.retweet_count[0])

numpy.int64

### Issue #3:

#### Define: Columns unused for analysis should be drop (text, in_reply_to_status_id, in_reply_to_user_id)

#### Code

In [106]:
df_twitter_archive = df_twitter_archive.drop(['text', 'in_reply_to_status_id', 'in_reply_to_user_id'], axis=1)

#### Test

In [107]:
df_twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                      2356 non-null int64
timestamp                     2356 non-null object
source                        2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(2), int64(3), object(9)
memory usage: 257.8+ KB


### Issue #4:

#### Define: The timestamp column has wrong type. Convert to datetime type.

#### Code

In [108]:
df_twitter_archive.timestamp = df_twitter_archive.timestamp.astype('datetime64')

#### Test

In [109]:
type(df_twitter_archive.timestamp[0])

pandas._libs.tslibs.timestamps.Timestamp

In [110]:
df_twitter_archive.timestamp[0]

Timestamp('2017-08-01 16:23:56')

### Issue #5:

#### Define: Remove retweet rows

#### Code

In [111]:
df_twitter_archive = df_twitter_archive[df_twitter_archive['retweeted_status_id'].isna() |
                                       df_twitter_archive['retweeted_status_user_id'].isna() |
                                       df_twitter_archive['retweeted_status_timestamp'].isna()]                                    

#### Test

In [112]:
# Percentage of null values in columns
df_twitter_archive.isna().sum() * 100 /len(df_twitter_archive)

tweet_id                        0.000000
timestamp                       0.000000
source                          0.000000
retweeted_status_id           100.000000
retweeted_status_user_id      100.000000
retweeted_status_timestamp    100.000000
expanded_urls                   2.666667
rating_numerator                0.000000
rating_denominator              0.000000
name                            0.000000
doggo                           0.000000
floofer                         0.000000
pupper                          0.000000
puppo                           0.000000
dtype: float64

In [113]:
df_twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                      2175 non-null int64
timestamp                     2175 non-null datetime64[ns]
source                        2175 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2117 non-null object
rating_numerator              2175 non-null int64
rating_denominator            2175 non-null int64
name                          2175 non-null object
doggo                         2175 non-null object
floofer                       2175 non-null object
pupper                        2175 non-null object
puppo                         2175 non-null object
dtypes: datetime64[ns](1), float64(2), int64(3), object(8)
memory usage: 254.9+ KB


### Issue #6:

#### Define: Column expanded_urls has rows with null values.

#### Code

In [114]:
df_twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                      2175 non-null int64
timestamp                     2175 non-null datetime64[ns]
source                        2175 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2117 non-null object
rating_numerator              2175 non-null int64
rating_denominator            2175 non-null int64
name                          2175 non-null object
doggo                         2175 non-null object
floofer                       2175 non-null object
pupper                        2175 non-null object
puppo                         2175 non-null object
dtypes: datetime64[ns](1), float64(2), int64(3), object(8)
memory usage: 254.9+ KB


In [115]:
df_twitter_archive = df_twitter_archive[df_twitter_archive['expanded_urls'].notna()]  

#### Test

In [116]:
df_twitter_archive.expanded_urls.isna().sum()

0

### Issue #7:

#### Define: Source column is in HTML-formatted string, not a normal string

#### Code

In [117]:
df_twitter_archive.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     1985
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       30
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [118]:
#extract values
df_twitter_archive.source = df_twitter_archive.source.str.extract('>([\w\W\s]*)<', expand=True)

#### Test

In [119]:
df_twitter_archive.source.value_counts()

Twitter for iPhone     1985
Vine - Make a Scene      91
Twitter Web Client       30
TweetDeck                11
Name: source, dtype: int64

### Issue #8:

#### Define: Drop row which dog names not corrected

#### Code

In [120]:
wrong_name = ['a', 'actually', 'all', 'an', 'by', 'getting', 'his', 'incredibly', 'infuriating',
'just', 'life', 'light', 'mad', 'my', 'not', 'officially', 'old', 'one', 'quite',
'space', 'such', 'the', 'this', 'unacceptable','very']

In [121]:
df_twitter_archive = df_twitter_archive.loc[~df_twitter_archive['name'].isin(wrong_name)]

#### Test

In [122]:
mask = df_twitter_archive.name.str.contains('^[a-z]', regex = True)
df_twitter_archive[mask].name.value_counts().sort_index()

Series([], Name: name, dtype: int64)

### Issue #9:

#### Define: Replace doggo, floofer, pupper, and puppo columns with one category column name dog_type and drop them after that

#### Code

In [123]:
df_twitter_archive['add_all'] = df_twitter_archive.doggo + df_twitter_archive.floofer + df_twitter_archive.pupper + df_twitter_archive.puppo

In [124]:
df_twitter_archive.add_all.value_counts()

NoneNoneNoneNone        1689
NoneNonepupperNone       211
doggoNoneNoneNone         70
NoneNoneNonepuppo         23
doggoNonepupperNone        9
NoneflooferNoneNone        9
doggoflooferNoneNone       1
doggoNoneNonepuppo         1
Name: add_all, dtype: int64

In [125]:
df_twitter_archive['dog_stage'] = df_twitter_archive['add_all'].astype('str')

In [126]:
df_twitter_archive['dog_stage'] = df_twitter_archive['dog_stage'].str.replace("None", "")

In [127]:
df_twitter_archive = df_twitter_archive.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1)

#### Test

In [128]:
df_twitter_archive.dog_stage.value_counts()

                1689
pupper           211
doggo             70
puppo             23
floofer            9
doggopupper        9
doggofloofer       1
doggopuppo         1
Name: dog_stage, dtype: int64

### Issue #10:

#### Define: Twitter API table and Image prediction table should be merged into Twitter archive table

#### Code

In [129]:
df_final_data = pd.merge(df_twitter_archive, df_image, on = "tweet_id", how = "left")

In [130]:
df_tweet_json["tweet_id"]  = df_tweet_json["tweet_id"].astype(int)

In [131]:
df_final_data = pd.merge(df_final_data, df_tweet_json, on = "tweet_id", how = "left")

#### Test

In [132]:
df_final_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2013 entries, 0 to 2012
Data columns (total 28 columns):
tweet_id                      2013 non-null int64
timestamp                     2013 non-null datetime64[ns]
source                        2013 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2013 non-null object
rating_numerator              2013 non-null int64
rating_denominator            2013 non-null int64
name                          2013 non-null object
add_all                       2013 non-null object
dog_stage                     2013 non-null object
jpg_url                       1896 non-null object
img_num                       1896 non-null float64
p1                            1896 non-null object
p1_conf                       1896 non-null float64
p1_dog                        1896 non-null object
p2                        

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [139]:
df_final_data.to_csv("twitter_archive_final.csv", index=False)

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [140]:
df_final_data = pd.read_csv('twitter_archive_final.csv')
df_final_data

Unnamed: 0,tweet_id,timestamp,source,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,...,p2_conf,p2_dog,p3,p3_conf,p3_dog,favorite_count,followers_count,retweet_count,retweeted,truncated
0,892420643555336193,2017-08-01 16:23:56,Twitter for iPhone,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,...,0.085851,False,banana,0.076110,False,39467,3200889,8853,False,False
1,892177421306343426,2017-08-01 00:17:27,Twitter for iPhone,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,...,0.090647,True,papillon,0.068957,True,33819,3200889,6514,False,False
2,891815181378084864,2017-07-31 00:18:03,Twitter for iPhone,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,...,0.078253,True,kelpie,0.031379,True,25461,3200889,4328,False,False
3,891689557279858688,2017-07-30 15:58:51,Twitter for iPhone,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,...,0.168086,True,spatula,0.040836,False,42908,3200889,8964,False,False
4,891327558926688256,2017-07-29 16:00:24,Twitter for iPhone,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,...,0.225770,True,German_short-haired_pointer,0.175219,True,41048,3200889,9774,False,False
5,891087950875897856,2017-07-29 00:08:17,Twitter for iPhone,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,...,0.116317,True,Indian_elephant,0.076902,False,20562,3200889,3261,False,False
6,890971913173991426,2017-07-28 16:27:12,Twitter for iPhone,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,...,0.199287,True,ice_lolly,0.193548,False,12041,3200889,2158,False,False
7,890729181411237888,2017-07-28 00:22:40,Twitter for iPhone,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,...,0.178406,True,Pembroke,0.076507,True,56848,3200889,16716,False,False
8,890609185150312448,2017-07-27 16:25:51,Twitter for iPhone,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,...,0.193054,True,Chesapeake_Bay_retriever,0.118184,True,28226,3200889,4429,False,False
9,890240255349198849,2017-07-26 15:59:51,Twitter for iPhone,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,...,0.451038,True,Chihuahua,0.029248,True,32467,3200889,7711,False,False


### Insights:
1. Year with most twitter

2. Source with most twitter

3. Dog with most popular name

In [1]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_report.ipynb'])

0