
## Introduction
This project is about combining and cleaning different data sources to make them easily usable for visualization, models etc. The data are about tweets of dogs, specifically there are three different data sources:
* The WeRateDogs Twitter archive, provided as a csv file called **twitter-archive-enhanced.csv**
* The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. These data are hosted in the Udacity servers and accessed through an http request. The response is a tsv file called **image_predictions.tsv**
* Each tweet's retweet count and favorite ("like") count, accessed through tweeter's API and stored programmatically in a txt file called **tweet_json.txt**.

### Loading and cleaning the locally stored csv called twitter-archive-enhanced.csv
Loading a locally stored file is pretty straightforward. After that and by performing some assesements, the issues below are noted and handled. All the cleaning is performed in a copy of the original dataframe called **arch_clean_df**:<br><br>
**tidyness issues:**
* The dog-state information occupies 4 columns that actually repeat the column name when it is true. This is handled by replacing the 4 columns with one called "dog_state". For tweets with multiple dogs (that also have multiple states) we provide 2 or more values seperated by commas in the corresponding cell
* The "denominator" column is the same for all rows, it is redundant since we can provide a single column for the ratings that could hold a float as a result. This is handled by replacing the "rating_numerator" and "rating_denominator" values with the product of their division (but not before we re-capture the numerator so that ignored decimals that were missing in some cases are now in place). Now a single column called "rating" exists.
<br>

**cleaniness issues:**
* Several columns ('in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp') inform us whether we are dealing with an original tweet or not. In detail wherever the first two have values we are dealing with replies to tweets and wherever the 3 others do we are dealing with retweets. Since we are interested in original tweets only we have removed all the rows where the 5 columns above have non-null values. After that the 5 columns themselves are deleted since they don't provide any information.
* "text" is not a very descriptive column name. We changed this to "dog_description"
* The "source" column is the same for all rows and is an html tag rather than a raw link. This could be fixed anytime, but since we didn't use that column we left it as it is though
* The "tweet_id" column should be of type object (string) and not integer. This is fixed
* The rating numerator and denominator should be floats. Also they need to be more carefully extracted from the text so that decimal points are preserved. This is fixed before replacing those with the "rating" column
* The dog names are often wrong, random words are in place of names. They could be programmatically identified because they don't begin with a capital character. We could fix this but since we won't be using the names we just dropped that column, in any case names can be extracted by the "dog_description" at any point

In conclusion, **2 tidyness and 5 cleaning issues were handled in the twitter-archive-enhanced.csv data.**

### Programatically downloading and cleaning the data file called image_predictions.tsv
In order to access this file an http request is necessary since it is stored in Udacity's database. A piece of code that utilizes the os and the requests library is created to download this file only in case it does not already exist in the working directory. After that loading the file is the same process as before. This file is more tidy and clean, some issues are noted and below. All the cleaning is performed in a copy of the original dataframe called pred_clean_df:<br><br>

**cleaniness issues:**
* Since the algorithm provides 3 guesses, a certainty value for each and boolean to whether that guess corresponds to a dog or not, we use this information to only keep valid dog breeds with a conf. interval larger than 0.5. So after all this is a tidyness issue because the "dog" boolean columns aren't even required if proper filtering is applied in the data and "breeds" as well as "conf. intervals" can be stored in one column each. So the 9 columns originally present in the data can be replaced by just two
* We could add more descriptive column names in several occasions. Specifically, "img_num" is changed to "pred_sample_size" and the best guess columns we provide by the first issue are named "breed_pred" and "breed_conf"
* Again the "tweet_id" column should be of object type, we have handled this

In conclusion, **3 cleaning issues were handled in the image_predictions.tsv data.**

### Gathering, saving and cleaning data from the twitter api
At first, we provide our secret twitter api keys and tokens via a local script that is not visible in the notebook. The an api object is created with some parameters that can handle the request-rate limit. At first a single request is given and we print the result so that we get an idea of the response object. Then a piece of code is created that completes the following operations:
1. Checks if the 'tweet_json.txt' file exists in the working directory and terminates if so
2. If it does not, it iteratively sends a request for every tweet id in the two previous datasets
3. If a tweet is found, it writes the response in a line in the "tweet_json.txt" file, if not it saves the "failed" request tweet id in an array
4. At the end a message informs that the file was created and how many tweets were succesfully found.

Reading the 'tweet_json.txt' file is not as straightforward as in the previous cases because it contains a large ammount of information, which is why we target the parameters that interest us ('tweet_id', 'retweet_count', 'favorite_count'). We loop the file line by line and pull those parameters, we store them in a dictionary that is later converted in a dataframe.

Since we only pull specific parameters directly and we control the process the data are pretty tidy and clean. Again we fix the "tweet_id" being integer instead of object issue.

In conclusion, creating and reading this file is the hardest part and there are many cleaning issues. **Just 1 cleaning issue is handled in the 'tweet_json.txt' data**

### Merging the three data sources, cleaning occuring issues and saving the result
Our next goal is to merge the three dataframes, namely: **arch_clean_df, pred_clean_df, tweet_api_df**. All three of them definetly have the "tweet_id" column so it can be used as the key, but we need to keep in mind that they probably have different number of rows. If we use the default inner join, all rows that don't match will be lost, so instead we merge the first two with a left join, and their product is merged with the third dataframe with another left join. The final dataframe is called 

One issue that arises is that some columns are converted to float from integer ("pred_sample_size", "retweet_count", "fav_count"). Pandas automatically does this sometimes for memory reasons, especially if many nan values are present in the columns. This is fixed with some information provided <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na">here</a>

In conclusion **a tidyness issue is handled by merging the 3 datasets and 1 cleaning issue is handled in the merged dataset**

After that, our data are in the form we want it, so we store this final dataframe in a file called **'twitter_archive_master.csv'**