## Project: Wrangle and Analyze Data

**Documentation for data wrangling steps: gather, assess, and clean**

### Data Gathering

The data was gathered from three different sources and in three different formats as described below:

1. The WeRateDogs Twitter archive. This file was provided by Udacity and was downloaded manually by clicking the following link: twitter_archive_enhanced.csv and was saved as a csv file.

2. The tweet image predictions. This file was hosted on Udacity's servers and was downloaded programmatically using the Requests library and the following URL: image_predictions.tsv and was saved as a tsv file.

3. Retweet and Favourite count for each tweet. This file was queried from Twitter's API using the Tweepy library and the tweet IDs in the `WeRateDogs` Twitter archive. The data was stored in a text file file.

### Data Assessing

The data was assessed visually and programmatically for quality and tidiness issues.

The programmatic assessment was done using the following pandas functions:

1. `df.info()` to get a concise summary of the DataFrame.

2. `df.describe()` to get a mathemmatical summary of the columns with numerical data in the DataFrame.

3. `df.sample()` to get a random sample of the DataFrame.

4. `df.value_counts()` to get a count of unique values in a column.

5. `df.duplicated()` to get a count of duplicate rows in the DataFrame.

6. `df.isnull().sum()` to get a count of null values in each column.

7. `df[df.column_name].unique()` to get a list of unique values in a column.

... and many more.

The results of the assesments were documented and are shown below.

### Quality issues

`twitter_archive_enhance`

1. HTML tags in source column need to be removed

2. Null values in columns `in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp and expanded_urls` 

3. Null values written as 'None' in `name`, `floofer`, `doggo`, `pupper`, `puppo` columns.

4. The `timestamp` datatype should be datetime not object

5. There are retweets part of the dataset which should now be the case according to the instructions

6. The `source` column has only 4 different unique values, thus, they should be categorical instead of objects

7. Some names in the `names` column were wrongly written as 'a'

8. Some numerator ratings are significantly smaller than their denominator counterparts. This is odd and does not follow the defined schema or theme of the ratings which, i.e., 11/10, 14/10 etc.

9. 1 row has rating_denominator set to 0 which is not realistic.

10. 2 rows have rating_numerator set to 0 which is not realistic.

### Tidiness issues

`twitter_archive_engance`

1. The dog stages variable should form one column instead of four different columns

2. The two ratings columns in `twitter_archive_en` should be one column

`tweet_json`

3. `tweet_json.txt` and `twitter_archive_enhance` can be combined to form a table with one observational unit, i.e., tweets and their related statistics and information (no predictions)

### Data Cleaning

A copy of all three datasets were made befoe cleaning and the cleaning process was done in the following order:

1. Define: Define the cleaning steps required to address the issue.

2. Code: Write the code to address the issue.

3. Test: Test the code to ensure the issue was addressed.


Methods and Function used in cleaning the data include:

1. `df.drop()` to drop columns and rows

2. `.str.replace()` to replace values in a column

3. `df.astype()` to change the datatype of a column

4. `pd.merge()` to merge two DataFrames

5. `pd.to_datetime()` to convert a column to datetime

6. `.str.extract()` and regular expression to extract values from a column

... and many more

### Data Storing

The cleaned data were stored as csv files using the `to_csv()` method in pandas:

1. `twitter_archive_master.csv`

2. `twitter_archive_tweet_info.csv`

### References

1. https://help.start.gg/en/articles/1987102-customizing-text-with-markdown#:~:text=Aligning%20Text,the%20text%20in%20div%20tags.

2. https://www.w3schools.com/PYTHON/matplotlib_pie_charts.asp

3. https://www.digitalocean.com/community/tutorials/pandas-merge-two-dataframe