# Data Preprocessing - Formatting
This tutorial explains how to preprocess data using the pandas library. Preprocessing is the process of doing a pre-analysis of data, in order to transform them into a standard and normalized format. Preprocessing involves the following aspects:

* missing values
* data formatting 
* data normalization
* data standardization
* data binning

In this tutorial we will use the dataset related to Twitter, which can be downloaded from [this link](https://www.trackmyhashtag.com/twitter-dataset).

In this tutorial we deal only with data formatting. Data formatting is the process of transforming data into a common format, which helps users to perform comparisons. An example of not formatted data is the following: the same entity is referred in the same column with different values, such as New York and NY.

## Import data
Firstly, import data using the pandas library and convert them into a dataframe. Through the `head(10)` method we print only the first 10 rows of the dataset.

In [64]:
import pandas as pd
df = pd.read_csv('tweets.csv')
df.head(10)

Unnamed: 0,Tweet Id,Tweet URL,Tweet Posted Time (UTC),Tweet Content,Tweet Type,Client,Retweets Received,Likes Received,Tweet Location,Tweet Language,...,Name,Username,User Bio,Verified or Non-Verified,Profile URL,Protected or Non-protected,User Followers,User Following,User Account Creation Date,Impressions
0,"""1167429261210218497""",https://twitter.com/animalhealthEU/status/1167...,30 Aug 2019 13:30:00,Pets change our lives &amp; become a part of o...,Tweet,Twitter Ads Composer,0,4,Brussels,English,...,AnimalhealthEurope,animalhealthEU,AnimalhealthEurope represents manufacturers of...,Non-Verified,https://twitter.com/animalhealthEU,Non-Protected,3697,542,17 Dec 2012 09:14:15,7394
1,"""1167375334670557185""",https://twitter.com/PennyBrohnUK/status/116737...,30 Aug 2019 09:55:43,Another spot of our #morethanmedicine bus in #...,Tweet,Twitter Web App,0,5,"Pill, Bristol",English,...,Penny Brohn UK,PennyBrohnUK,We help people live well with the impact of ca...,Non-Verified,https://twitter.com/PennyBrohnUK,Non-Protected,3227,1571,15 Sep 2010 09:44:02,6454
2,"""1167237977615097861""",https://twitter.com/lordbyronaf/status/1167237...,30 Aug 2019 00:49:54,What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩...,ReTweet,Twitter for Android,0,0,"Ohio, USA",English,...,Lord ByronAF,lordbyronaf,"It's easier to be who you are, than it is to b...",Non-Verified,https://twitter.com/lordbyronaf,Non-Protected,7808,8617,25 Jul 2012 15:43:47,0
3,"""1167236897078480898""",https://twitter.com/CountessDavis/status/11672...,30 Aug 2019 00:45:37,What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩...,ReTweet,Twitter for Android,0,0,,English,...,Lisa Countess davis,CountessDavis,I am named after @ElvisPresley daughter Lisa M...,Non-Verified,https://twitter.com/CountessDavis,Non-Protected,291,81,26 Jan 2017 18:21:42,0
4,"""1167228378191204353""",https://twitter.com/Local12/status/11672283781...,30 Aug 2019 00:11:46,What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩...,ReTweet,TweetDeck,0,0,"Cincinnati, OH",English,...,Local 12/WKRC-TV,Local12,Local 12 is #Cincinnati's trusted source for b...,Verified,https://twitter.com/Local12,Non-Protected,198675,651,02 Sep 2008 20:09:44,0
5,"""1167228285463531520""",https://twitter.com/lbonis1/status/11672282854...,30 Aug 2019 00:11:23,What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩...,Tweet,Twitter for iPhone,3,17,WKRC TV,English,...,Liz Bonis,lbonis1,Health and Medical Reporter/News Anchor Regist...,Verified,https://twitter.com/lbonis1,Non-Protected,6015,4866,16 Mar 2013 12:05:02,12033
6,"""1167163662051631104""",https://twitter.com/luapppank/status/116716366...,29 Aug 2019 19:54:36,Will you be at #FIX19? Want a preview of @AG_E...,ReTweet,Twitter for iPhone,0,0,"Scottsdale, AZ",English,...,paul knapp,luapppank,16.2 (and rising) hdcp golfer. fairways and gr...,Non-Verified,https://twitter.com/luapppank,Non-Protected,69,81,17 May 2009 12:58:43,0
7,"""1167109799282118656""",https://twitter.com/AG_EM33/status/11671097992...,29 Aug 2019 16:20:34,Will you be at #FIX19? Want a preview of @AG_E...,ReTweet,Twitter for iPhone,0,0,,English,...,Alin,AG_EM33,DO &amp; MPH | EM resident | Interests: #FOAMe...,Non-Verified,https://twitter.com/AG_EM33,Non-Protected,9655,3043,26 Mar 2016 03:57:53,0
8,"""1167071808895541248""",https://twitter.com/andyglittle/status/1167071...,29 Aug 2019 13:49:37,Will you be at #FIX19? Want a preview of @AG_E...,ReTweet,Twitter for iPhone,0,0,"Columbus,OH",English,...,Andy Little,andyglittle,"EM DOC @DoctorsEMres, PodCaster, lover of my f...",Non-Verified,https://twitter.com/andyglittle,Non-Protected,1697,1123,25 Nov 2013 19:42:29,0
9,"""1167069888474537984""",https://twitter.com/MOX13/status/1167069888474...,29 Aug 2019 13:41:59,Will you be at #FIX19? Want a preview of @AG_E...,ReTweet,Twitter for iPhone,0,0,,English,...,Tanner Gronowski,MOX13,"Emergency Medicine Doc, outdoor enthusiast, am...",Non-Verified,https://twitter.com/MOX13,Non-Protected,796,613,02 Jun 2009 17:50:10,0


In this tutorial, we drop all the missing values through the `dropna()` function.

In [68]:
df.dropna(inplace=True)

## Incorrect data types
First of all, we should make sure that every column is assigned to the correct data type. This can be checked through the property `dtypes`. 

In [65]:
df.dtypes

Tweet Id                      object
Tweet URL                     object
Tweet Posted Time (UTC)       object
Tweet Content                 object
Tweet Type                    object
Client                        object
Retweets Received              int64
Likes Received                 int64
Tweet Location                object
Tweet Language                object
User Id                       object
Name                          object
Username                      object
User Bio                      object
Verified or Non-Verified      object
Profile URL                   object
Protected or Non-protected    object
User Followers                 int64
User Following                 int64
User Account Creation Date    object
Impressions                    int64
dtype: object

In our case we can convert the column `Tweet Location` to `string` by using the function `astype()` as follows:

In [66]:
df['Tweet Location'] = df['Tweet Location'].astype('string')

The `astype()` function supports all datatypes described at [this link](https://www.pytables.org/usersguide/datatypes.html).

## Make the data homogeneous
This aspect involves categorical and numeric data. Categorical data should have all the same formatting style, such as lower case. Numeric data should have for example the same number of digits after the point. In order to format all categorical data to lower case, we can use the following statement:

In [95]:
df['Tweet Content'] = df['Tweet Content'].str.lower()

## Different values for the same concept
It may happen that the same concept is represented in different ways. For example, in our dataset, the column `Twitter Location` contains the values `Columbus,OH` and `Columbus, OH` to describe the same concept. We can use the `unique()` function to list all the values of a column.

In [67]:
df['Tweet Location'].unique()

<StringArray>
[                                 'Brussels',
                             'Pill, Bristol',
                                 'Ohio, USA',
                                        <NA>,
                            'Cincinnati, OH',
                                   'WKRC TV',
                            'Scottsdale, AZ',
                               'Columbus,OH',
                              'Columbus, OH',
                             'DK Diner, USA',
 ...
                           'Kampala, Uganda',
                        'ilorin,kwara state',
                            'Nigeria, Lagos',
                                    'Kigali',
                        'Towcester, England',
 'Heart of the EU (the clue is in the name)',
                       'South West, England',
                                'Manchester',
                               'Seattle, WA',
                         'in my happy place']
Length: 106, dtype: string

In order to deal with different values representing the same concept, we should manipulate each type of error separately. For example, we can manipulate every string `word,word` in order to insert a space after the comma and have the following output `word, word`. We can define a function, called `set_pattern()` which searches for a specific pattern into a string and then it performs some replacement in the same string, if the pattern is found. In our case we search for all the patterns having the structure `word,word` and then we replace the `,` with `, `. Finally we return the result.

In [89]:
def set_pattern(x):
    pattern = r'[(A-Z)]\w+,([A-Z])\w+'
    res = re.match(pattern, x)
    if res:
        x = x.replace(',', ', ')
    return x

Now we can apply the function to every value in the column `Tweet Location`. This can be achieved by using the function `apply()` combined with the operator `lambda`. We can specify that the function `apply()` must be applied to every row (through the parameter `axis = 1`) and then through the `lambda` operator we can select the specific row and apply it the function `set_pattern()`.

In [92]:
df['Tweet Location'] = df.apply(lambda x: set_pattern(x['Tweet Location']), axis=1)