## Wrangling Report

### September 07, 2022

## Data Gathering

1. I directly downloaded the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)
    - dog_data = pd.read_csv('twitter_archive_enhanced.csv')

2. Used the Requests library to download the tweet image prediction (image_predictions.tsv)
 * using the Requests library and saved obtained data into an image_predictions.tsv then loaded into a DataFrame

 * image_pred = pd.read_csv('image_predictions.tsv', sep='\t')

3. Used the Tweepy library to query additional data via the Twitter API (tweet_json.txt)
  * tweet_count = pd.read_json('tweets.json', lines=True)
  * tweet_count = tweet_count[['id','favorite_count','retweet_count']]

### Checked the three DataFrames

* dog_data.head()
> head data of the archieved twitter data
* image_pred.head()
> head data of the image twitter data
* tweet_count.head()
> head data from twitter API

## Assessing Summary

### Quality issues

#### Tweeter archive table (dog_data)
1. __tweet_id__ has dtype int64 and should be object
2. __timestamp__ should be a datetime64 dtype type as well
3. For some doggs, their stages were not specified
4. Many missing names in __'Name'__ column, and random names like __'the'__, __'a'__ and __'an'__.
5. Keep only original tweets, remove from table retweets and replies 
6. Some records have more than one dog stage
7. Rating_numerator column has values less than 10 as well as some very large numbers 
8. Rating_denominator column has values way other than 10 (way higher than 10)

#### Image prediction (image_pred)
1. The types of dogs in columns __p1__, __p2__, and __p3__ should be in lowercase.


#### Tweet count table (tweet_count)
1. Convert __id__ coulumn to __tweet_id__.

### Tidiness issues

1. Tweet count table should be merged with tweeter archive table.
2. __Source__ column in Tweeter archive table makes table appear rough.
3. __Doggo__, __floofer__, __pupper__, __puppo__ columns in __twitter_archive_enhanced__ table should be in one column named __Stage__.
4. Drop data without any retweets in tweeter archive table (dog_data)
5. All three tables will eventually be merged into one.

## Cleaning Data
In this section, I cleaned **all** the issues I documented while assessing. 

I noted to make copies of each table first before cleaning as follows: 
* dog_data_clean as dog_data.copy()
* image_pred_clean as image_pred.copy()
* tweet_count_clean as tweet_count.copy()

### Issue #1:  Some columns are irrelevant in this analysis and should be dropped

#### Define: 

- I observed that some columns won't be needed in my analysis so I dropped columns so that the table can look a little less messy and more organized

### Issue #2: Fix some column names

#### Define

- Changed column name 'id' in tweet_count table to 'tweet_id' to have a uniform id column across all the 3 tables.

### Issue #3: Fixing Datatypes.

#### Define

- I converted 'timestamp' in the dog_data table to be in the datetime datatype
- Also converted 'favorite_count' and 'retweet_count' in the tweet_count table to integer

### Issue #4: Uniform the dog breeds

#### Define

- Converted all values in p1, p2 and p3 to lower case

### Issue #5: Create rating column i.e (rating numerator/rating denominator)

#### Define

- I created the rating column by dividing the rating numerator by itds denominator to make the table more organized

### Issue #6: Create stage column and collapse doggo, floofer, pupper, puppo into stage column

#### Define

- Here, I created a function that creates the stage column and writes the value of each row from doggo, floofer, pupper and puppo into that stage column. Then i dropped the columns since they it has been recorded in the stage column

### Issue #7: Some rows in tweet archive table have dog names as a, an, None, the, fillup, such and quite

#### Define

- Here, I observed that some rows have unrelated values in the name column such as a, an, None, the, fillup, such and quite, so I dropped those rows

### Issue #8: Merge tweet_archive table with the tweet count table

#### Define
 - Here, i merged the tweet_count table to the tweet_archive table(dog_data_cleaned)