# Gather
## Libraries Imported
- pandas
- numpy
- requests
- os
- datetime
- config  
It is where the API keys are hidden as it is hidden in the .gitignore file.

## Data
- WeRateDogs Tweets  
Data is imported using read_csv function from pandas library as the csv file is ready to load.

- Image Predictions  
Using the requests library, file is downloaded from this [url](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv) and is saved to the repo.

- Retweet and Favourite Counts  
Additional data collection is required which are the counts for both Retweets and Favourites of the tweets.  
A script is created to contain the API keys and is added to the .gitignore file. Tweepy library is used to have an authorization for collecting the required data.  
A for loop is iterated in the tweets ids provided in the tweets data provided previously where in the body of it get_status function is used to extract the retweet_count and favorite_count along with the tweet_id.  
A new dataframe is created called ret_fav_count, the data is then stored to the repo, so these steps would not be repeated in the future and instead the read_csv function would be used to import the data. 

# Assess
## Quick Look
- info() function is used for the three tables to have an idea on what the tables contain
- duplicated().sum() is used to check for duplicates
- isnull() and notnull() functions are used to check for NaN values
- sort_values() is used to check for the quality of quantitative values like rating_numerator

## Notes
### Quality
tweets table
1. in_reply_to_status_user_id and in_reply_to_user_id columns are float, and have e+17 and a dot.  
2. timestamp column is object  
3. retweets act as duplicates  
4. numerator of index 55 is not 17, but 13 instead  
5. denominator does not have 10 as the only value, rating score should be one column  
6. rating score has outliers (some of them because of decimals in the numerators)  
7. expanded_urls column has repeated values if the tweet has more than one photo  
8. name column has 'a' and 'the' values that should be replaced with 'None'  

images table
1. 3 falses in the p#_dog value

ret_fav_count table
1. 139 tweets have 0 value in the favourite count, that's because they are related to retweets

### Tidiness
tweets table
1. doggo, floofer, pupper, and puppo are 4 attributes that should be combined to be just one attribute name 'stage'

images table
1. one breed column should be created  
2. breed column should belong to tweets table

ret_fav_count table
1. it should belong to the tweets table

# Clean
## Quality
### tweets table
- change in_reply_to_status_user_id and in_reply_to_user_id columns from float to string
- remove "e+17" and the dot
- replace NaN values with "N/A"  

- change timestamp column from object to datetime  

- remove retweets as they act as duplicates
- drop retweeted_status_user_id and retweeted_status_id columns  

- change numerator of index 55 is not 17, but 13 instead  

- create a rating_score column by dividing rating_numerator by rating_denominator
- drop rating_numerator and rating_denominator columns  

- remove outliers in rating_score column (some of them had decimals in the numerators)
    1. since WeRateDogs rate dogs mostly 10/10, then values less than 1 will be substituted as 1
    2. remove values more than 1.5  

- make the expanded_urls with only one value per tweet  

- replace 'a' or 'the' with 'None' in name column  

### images table
- remove rows with 3 falses in the p#_dog value  
Neural Network had given 3 outputs with different probabilities to what is in the picture fed to it. If the picture is classified as a dog's breed, there would be a True in p# column, if not it will have a false. That's why if there are 3 falses in p# columns, then the whole row should be removed.

### retweet and favourite counts table
- remove records with favourite_count equal to zero  After investigating, these are actually retweets, they should be dropped anyways.

## Tidiness
### tweets table
create 'stage' column for all 4 columns; doggo, floofer, pupper, and puppo
1. specify columns that will not change
2. use melt function with variable name 'dog' and value 'stage'
3. drop dog column
4. drop duplicates

### images table
- select breed value depending on boolean values in p#_dog  
Since the NN is providing more than one breed with different probabilities, the breed with the highest probability gets to be chosen as the breed of the dog in the picture.

- combine breeds column to tweets table
A new table is created called "breed" to contain the tweet_id and the breed of the dog in the picture.

### ret_fav_counts table
- combine ret_fav_counts table to tweets table

# Store
data is now cleaned and stored to the repo as 'twitter_archive_master.csv'