# Module 5: Data Cleaning and String Methods

Often times, the data sets downloaded online are not in the perfect format for you to start your analysis. 
More often than not, you'll be spending the bulk of your time cleaning and manipulating the data that you accquired. 
In this module, We will be showing you how to deal with missing values, converting columns to different data types, 
and introducing a new data type that formats dates - `datetime` 

Let's first take a look at the landslides dataset from the kaggle data cleaning challenge. 
This dataset records the Global Landslide Catalog (GLC) that was developed with the goal of identifying rainfall-triggered landslide 
events around the world, regardless of size, impacts, or location. 
The GLC considers all types of mass movements triggered by rainfall that have been reported in the media, disaster databases, scientific reports, or other sources.

Dataset downloaded from: [https://www.kaggle.com/rtatman/data-cleaning-challenge-parsing-dates](https://www.kaggle.com/rtatman/data-cleaning-challenge-parsing-dates)
Optional: if you want to download this dataset from this source, scroll down to the `input(3)` section, under `Data Sources` --> `Landslides After Rainfall, 2007-2016` --> `catalog.csv`.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
landslides = pd.read_csv('landslides.csv')
landslides.head()
pd.options.display.max_columns = None

Unnamed: 0,id,date,time,continent_code,country_name,country_code,state/province,population,city/town,distance,...,geolocation,hazard_type,landslide_type,landslide_size,trigger,storm_name,injuries,fatalities,source_name,source_link
0,34,3/2/07,Night,,United States,US,Virginia,16000,Cherry Hill,3.40765,...,"(38.600900000000003, -77.268199999999993)",Landslide,Landslide,Small,Rain,,,,NBC 4 news,http://www.nbc4.com/news/11186871/detail.html
1,42,3/22/07,,,United States,US,Ohio,17288,New Philadelphia,3.33522,...,"(40.517499999999998, -81.430499999999995)",Landslide,Landslide,Small,Rain,,,,Canton Rep.com,http://www.cantonrep.com/index.php?ID=345054&C...
2,56,4/6/07,,,United States,US,Pennsylvania,15930,Wilkinsburg,2.91977,...,"(40.4377, -79.915999999999997)",Landslide,Landslide,Small,Rain,,,,The Pittsburgh Channel.com,https://web.archive.org/web/20080423132842/htt...
3,59,4/14/07,,,Canada,CA,Quebec,42786,Châteauguay,2.98682,...,"(45.322600000000001, -73.777100000000004)",Landslide,Riverbank collapse,Small,Rain,,,,Le Soleil,http://www.hebdos.net/lsc/edition162007/articl...
4,61,4/15/07,,,United States,US,Kentucky,6903,Pikeville,5.66542,...,"(37.432499999999997, -82.493099999999998)",Landslide,Landslide,Small,Downpour,,,0.0,Matthew Crawford (KGS),


## Dropping missing values and NaN imputations

Before doing any analysis on the data, it's always good to check your data for missing values. 
If the data is missing from the CSV you read in, the missing values will often be portrayed as `NaN`, which is the dafualt way pandas encode missing values. 
Sometimes the missing values are caused by unintentional manual or programatic errors, but there are other times where the data is actually missing.
Before you deal with the missing values, it's always better to read the description of the data to understand where the missing data might be from, and seeing where the missing data are distributed in case the pattern of missing data also discloses more information.

It is ultimately a subjective decision to keep the `NaN` values or not. If the missing values you are seeing is due to manual input errors, then imputing the data might be a good option for you.
However, if it's programatic error that caused the missing values, then it might be more beneficial for you to discover what the programatic error is, and to correct the error so the missing values can be read in correctly. 
With the actual missing values, same thought process apllies: what is the pattern of the missing data? What is the reason for the missing data?
Often times, it is much better to find the underlying reason for the missing data and fill it in accordingly. 

It can be tempting to fill in all missing values as `0`, but often times, this is not the best course of action. 
For example, if your data records the home sale prices, then just replacing the missing sale prices with `0` may imply that the property is actually sold for $0. 


### Type of missing values

There are many types of missing values, the most common of which being `NaN`. 
In some datasets, however, they would input the missing data as 1, -9999, or infinity. 
Sometimes the `NaN` values are already dealt with in the data, where the missing values for strings may be encoded as "N/A" or "missing".
If the missing data is time, it might be presented as 00:00:00 UTC, January first, 1970. 
This is because Linux is following the tradition set by Unix of counting time in seconds since its official "birthday," - called "epoch" in computing terms - which is Jan. 1, 1970.
This date is also referred to as the default Unix time.

In the landslide dataset we loaded above, the missing values are of type `NaN`

Before dealing with the missing values, let's look at how many rows and how many columns our data has. 
This will give us a rough idea on how much of the data is missing from a specific column later on. 

In [None]:
landslides.shape

(1693, 23)

In pandas, `df.isnull()` will return a data frame of `True` and `False` values, where `True` signifies that there is a missing value at
 that specifc row and column

In [None]:
landslides.isnull().head()

Unnamed: 0,id,date,time,continent_code,country_name,country_code,state/province,population,city/town,distance,...,geolocation,hazard_type,landslide_type,landslide_size,trigger,storm_name,injuries,fatalities,source_name,source_link
0,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,False,False
1,False,False,True,True,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,False,False
2,False,False,True,True,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,False,False
3,False,False,True,True,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,False,False
4,False,False,True,True,False,False,False,False,False,False,...,False,False,False,False,False,True,True,False,False,True


If you want to see a summary of all columns with their corresponding number of missing values, use `df.isnull().sum()`.
This is because `True` values are internally stored as ones, and the `False` values are stored as zeros.

In [None]:
landslides.isnull().sum()

id                         0
date                       3
time                    1064
continent_code          1529
country_name               0
country_code               0
state/province             1
population                 0
city/town                  4
distance                   1
location_description    1142
latitude                   1
longitude                  1
geolocation                1
hazard_type                0
landslide_type             1
landslide_size             1
trigger                    2
storm_name              1561
injuries                1178
fatalities               247
source_name              821
source_link              100
dtype: int64

Interestingly enough, there are a lot of columns with less than three missing values. 
Are the same rows causing these missing values on all of the columns? Let's find out.

In [None]:
landslides[landslides['latitude'].isnull()]

Unnamed: 0,id,date,time,continent_code,country_name,country_code,state/province,population,city/town,distance,...,geolocation,hazard_type,landslide_type,landslide_size,trigger,storm_name,injuries,fatalities,source_name,source_link
596,3177,3/7/11,2:00:00,,United States,US,Connecticut,28142,New Milford,,...,,Landslide,Mudslide,Medium,Downpour,,,0.0,,http://www.newstimes.com/local/article/New-Mil...


In [None]:
landslides[landslides['landslide_type'].isnull()]

Unnamed: 0,id,date,time,continent_code,country_name,country_code,state/province,population,city/town,distance,...,geolocation,hazard_type,landslide_type,landslide_size,trigger,storm_name,injuries,fatalities,source_name,source_link
1511,7137,4/17/15,17:00,,Canada,CA,Nova Scotia,2052,Digby,11.62624,...,"(44.564399999999999, -65.636200000000002)",Landslide,,,Unknown,,0.0,0.0,The Digby County Courier,http://www.digbycourier.ca/News/Local/2015-04-...


As it turns out, the two rows above is responsible for the only `NaN` values for 6 columns. 
Let's delete these rows using their index. 

Be aware of the impact that deleting these rows might have on your own analysis, and depending on your research, 
deleting observations with some missing data might not be advisable. 
For our purposes here, deleting the two observations will not have a negative impact. 

In [None]:
landslides=landslides.drop([596, 1511])


Now let's look at the summary of all columns with their corresponding number of missing values again. 

In [None]:
landslides.isnull().sum()

id                         0
date                       3
time                    1064
continent_code          1527
country_name               0
country_code               0
state/province             1
population                 0
city/town                  4
distance                   0
location_description    1141
latitude                   0
longitude                  0
geolocation                0
hazard_type                0
landslide_type             0
landslide_size             0
trigger                    2
storm_name              1559
injuries                1177
fatalities               247
source_name              820
source_link              100
dtype: int64

From the summary above, we can see that some of the columns have a very high proportion of missing values: 
there are only 1693 rows in the dataset in general, and column `continent_code` has  1527 values missing!

Apart from the column `continent_code`, some of the columns with high proportions of missing values are 
`time`, `location_description`, `storm_name`, `injuries`, `fatalities`, `source_name`, and `source_link`.

For analysis purposes, if some of these columns aren't very useful they could be dropped. 
For our purposes here, we will keep all of the columns, 
but we'll be replacing the `NaN` values on some of the non-numerical columns with "Missing". 
This is normally done to reduce the number of bugs caused by missing values in the future. 

Before we start replacing the missing values, let's first see the data types of each of our columns using `df.dtypes`

In [None]:
landslides.dtypes

id                        int64
date                     object
time                     object
continent_code           object
country_name             object
country_code             object
state/province           object
population                int64
city/town                object
distance                float64
location_description     object
latitude                float64
longitude               float64
geolocation              object
hazard_type              object
landslide_type           object
landslide_size           object
trigger                  object
storm_name               object
injuries                float64
fatalities              float64
source_name              object
source_link              object
dtype: object

To replace `NaN` values in a column with any string, use `df['col_name'].fillna('string', inplace=True)`. 
Here, we will be replacing the `NaN` values in columns 
`time`, `continent_code`, `state/province`, `city/town`, `location_description`, `trigger` `storm_name`, `source_name`, and `source_link`.

Here we are filling the missing values for each individual column separately.

In [None]:
landslides['time'].fillna("Missing", inplace=True)
landslides['continent_code'].fillna("Missing", inplace=True)
landslides['state/province'].fillna("Missing", inplace=True)
landslides['city/town'].fillna("Missing", inplace=True)
landslides['location_description'].fillna("Missing", inplace=True)
landslides['trigger'].fillna("Missing", inplace=True)
landslides['storm_name'].fillna("Missing", inplace=True)
landslides['source_name'].fillna("Missing", inplace=True)
landslides['source_link'].fillna("Missing", inplace=True)
landslides.isnull().sum()

id                         0
date                       3
time                       0
continent_code             0
country_name               0
country_code               0
state/province             0
population                 0
city/town                  0
distance                   0
location_description       0
latitude                   0
longitude                  0
geolocation                0
hazard_type                0
landslide_type             0
landslide_size             0
trigger                    0
storm_name                 0
injuries                1177
fatalities               247
source_name                0
source_link                0
dtype: int64

Alternatively, we can pass in a dictionary to `.fillna()` where the keys are the columns that we want to fill and the associated values are the 
strings(or other data) we want to replace the missing data with.

In [None]:
replace_dict = {'time': 'Missing', 
                'continent_code': 'Missing',
                'state/province': 'Missing',
                'city/town': 'Missing',
                'location_description': 'Missing',
                'trigger': 'Missing',
                'storm_name': 'Missing',
                'source_name': 'Missing',
                'source_link': 'Missing'}

landslides = landslides.fillna(replace_dict)

landslides.isnull().sum()

id                         0
date                       3
time                    1064
continent_code          1529
country_name               0
country_code               0
state/province             1
population                 0
city/town                  4
distance                   1
location_description    1142
latitude                   1
longitude                  1
geolocation                1
hazard_type                0
landslide_type             1
landslide_size             1
trigger                    2
storm_name              1561
injuries                1178
fatalities               247
source_name              821
source_link              100
dtype: int64

For the missing values in injuries and fatalities, let's replace them with the median value from their respective columns. 
We chose to replace these with the median instead of the mean here because the mean is easily skewed by extreme values, 
which is likely present here since larger, named landslides are likely to cause large numbers of injuries and fatalities.

With your own data, you need to make a subjective judgement on how you want to impute on the missing values: mean, median, or mode?

Here the `df.fillna()` function comes to be useful again, except now we are filling the missing values with the medians of each column instead of "missing". 
We are also not specifying which columns to impute on since the only columns left with missing values are `date`, `injuries`, and `fatalities`. 
`df.fillna(df.median())` will fill all of the `NaN` values in the numeric columns, and since `injuries` and `fatalities` are the only numeric columns of the three, only these two will be affeced by the `df.fillna()` command below. 


In [None]:
landslides.fillna(landslides.median(), inplace=True)
landslides.isnull().sum()

id                      0
date                    3
time                    0
continent_code          0
country_name            0
country_code            0
state/province          0
population              0
city/town               0
distance                0
location_description    0
latitude                0
longitude               0
geolocation             0
hazard_type             0
landslide_type          0
landslide_size          0
trigger                 0
storm_name              0
injuries                0
fatalities              0
source_name             0
source_link             0
dtype: int64

What about the missing date values? Since there's a special way to present dates in pandas, 
we will discuss the way pandas treat missing date values and datetime objects later on.
If you are interested in learning more about imputing on missing data and how to handle missing valies, 
a good starting point is [this blog](https://medium.com/@drnesr/filling-gaps-of-a-time-series-using-python-d4bfddd8c460)


## Type Conversion 



In a pandas dataframe (or series), each column consists of only one type of data. For example, `int64` for integers, `float64` for decimals, `object` for strings, `datetime64` for dates and `bool` for true/false values.

Most of these are standard; except `datetime64`. Commonly known as datetime, this is a representation of dates in python but more on this later in the module.

When loading in data using `pd.read_csv`, we have the option to set the datas types of the columns ourselves using the optional parameter `dtypes`. If we do not use it, pandas will assign the type itself. Lets have a look at the data types which python has assigned to the landslides table.

In [None]:
landslides.dtypes

id                               int64
date                    datetime64[ns]
time                            object
continent_code                  object
country_name                    object
country_code                    object
state/province                  object
population                       int64
city/town                       object
distance                       float64
location_description            object
latitude                       float64
longitude                      float64
geolocation                     object
hazard_type                     object
landslide_type                  object
landslide_size                  object
trigger                         object
storm_name                      object
injuries                       float64
fatalities                     float64
source_name                     object
source_link                     object
dtype: object

The column `country_name` is of type `object`. Lets have a look at the actual data type of each element.

In [None]:
landslides.country_name

0       United States
1       United States
2       United States
3              Canada
4       United States
            ...      
1688    United States
1689    United States
1690    United States
1691    United States
1692    United States
Name: country_name, Length: 1693, dtype: object

In [None]:
type(landslides.country_name[0])

str

As you can see, the `country_name` column actually contains strings but the series is recognised as an object in pandas. Regardless, we can perform string functions on it.

Next, let's have a look at the `latitude` column.

In [None]:
landslides.latitude

0       38.6009
1       40.5175
2       40.4377
3       45.3226
4       37.4325
         ...   
1688    35.2219
1689    38.3987
1690    37.4096
1691    37.5011
1692    43.4771
Name: latitude, Length: 1693, dtype: float64

The latitude is of type `float64` as it contains decimal numbers.

What about the default dtype of the `id` column? What did pandas read it in as?

In [None]:
landslides.id

0         34
1         42
2         56
3         59
4         61
        ... 
1688    7535
1689    7537
1690    7539
1691    7540
1692    7541
Name: id, Length: 1693, dtype: int64

It's been read in as `int64` because it contains integer values, which makes sense. 
But let us think about what `id` represents in our table. 
While it does contain numbers, its use is to uniquely identify each row. 
For example, the `id`s of the first two rows are 34 and 42, but it doesn't make sense to perform arithmatic opterations on these two `id`s 
(i.e you shouldn't try to add them).

We can convert the data types of colmuns by using `series.astype(type)`. 
This function converts the data type of an entire series. 
Lets convert the data type of column `id` from `int64` to a string. 
Note that if we use `str` as the 'type' in the code, it will be shown as an object. 

In [None]:
landslides.id.astype('str')

0         34
1         42
2         56
3         59
4         61
        ... 
1688    7535
1689    7537
1690    7539
1691    7540
1692    7541
Name: id, Length: 1693, dtype: object

Note we are displaying a converted version of the `id` column above, but have not changed the original column. 
To do that, we have to reassign the converted version back to the orignal dataframe like this.

In [None]:
landslides['id'] = landslides.id.astype('str')
landslides['id']

0         34
1         42
2         56
3         59
4         61
        ... 
1688    7535
1689    7537
1690    7539
1691    7540
1692    7541
Name: id, Length: 1691, dtype: object

## DateTime Objects 

`datetime` objects are the standard representation of dates in `Python` and `pandas`. 

We can check the type of the `date` column in the landslide dataset by calling `df.series.dtype`



In [None]:
landslides['date'].dtype

dtype('<M8[ns]')

Notice that the date column is represented as an `object` series rather than a `datetime` series. 

We can convert this series into a datetime series by calling `pd.to_datetime`

In [None]:
landslides.date

0      2007-02-03
1      2007-03-22
2      2007-06-04
3      2007-04-14
4      2007-04-15
          ...    
1688   2015-07-12
1689   2016-02-22
1690   2016-02-23
1691   2016-02-26
1692   2016-02-03
Name: date, Length: 1693, dtype: datetime64[ns]

In [None]:
pd.to_datetime(landslides['date'])

0      2007-02-03
1      2007-03-22
2      2007-06-04
3      2007-04-14
4      2007-04-15
          ...    
1688   2015-07-12
1689   2016-02-22
1690   2016-02-23
1691   2016-02-26
1692   2016-02-03
Name: date, Length: 1693, dtype: datetime64[ns]

The datetime conversion automatically converts our dates into the format `YYYY/MM/DD`. 
Additionally, the `pd.to_datetime` function is not inplace so we will have to reassign the existing `date` column to the returned datetime series. 

We can also use the `dayfirst` and `yearfirst` argument to specify the format of the original date column.

In [None]:
landslides['date'] = pd.to_datetime(landslides['date'], dayfirst=False, yearfirst=False)
landslides['date']

0      2007-02-03
1      2007-03-22
2      2007-06-04
3      2007-04-14
4      2007-04-15
          ...    
1688   2015-07-12
1689   2016-02-22
1690   2016-02-23
1691   2016-02-26
1692   2016-02-03
Name: date, Length: 1693, dtype: datetime64[ns]

### Accessing datetime attributes

With datetime objects, we can easily access the day, month, and year attribute of our dates.

In [None]:
one_date = landslides.loc[0, 'date']
one_date

Timestamp('2007-02-03 00:00:00')

In [None]:
print(one_date.year, one_date.month, one_date.day)

2007 2 3


We can also call datetime methods to get the year, month, and date of the entire datetime series:

In [None]:
landslides['date'].dt.year

0       2007.0
1       2007.0
2       2007.0
3       2007.0
4       2007.0
         ...  
1688    2015.0
1689    2016.0
1690    2016.0
1691    2016.0
1692    2016.0
Name: date, Length: 1693, dtype: float64

In [None]:
landslides['date'].dt.month

0        3.0
1        3.0
2        4.0
3        4.0
4        4.0
        ... 
1688    12.0
1689     2.0
1690     2.0
1691     2.0
1692     3.0
Name: date, Length: 1693, dtype: float64

In [None]:
landslides['date'].dt.day

0        2.0
1       22.0
2        6.0
3       14.0
4       15.0
        ... 
1688     7.0
1689    22.0
1690    23.0
1691    26.0
1692     2.0
Name: date, Length: 1693, dtype: float64

### Grouping and Aggregating

With datetime objects, we can easily group and aggregate data by a particular day, month, or year. 
For example, we can find the average distance for all landslides in a particular month with a simple `.groupby` call.

In [None]:
landslides[['date', 'distance']].groupby(landslides.date.dt.month).mean()

Unnamed: 0_level_0,distance
date,Unnamed: 1_level_1
1.0,8.02035
2.0,9.011861
3.0,6.030639
4.0,5.455916
5.0,7.599642
6.0,9.846973
7.0,8.879942
8.0,10.21335
9.0,8.620497
10.0,7.04763


**Challenge**: How would we add a new column to our dataframe corresponding to the name of the month in which the landslide occurred?

*Hint: Use `dt.month_name`*

In [None]:
landslides['month_name'] = landslides['date'].dt.month_name()
landslides['month_name']

0          March
1          March
2          April
3          April
4          April
          ...   
1688    December
1689    February
1690    February
1691    February
1692       March
Name: month_name, Length: 1691, dtype: object

For the missing datetime objects, if they are sequencial in order, then you can try to impute the dates bases on neigboring rows. 
The most commonly used imputing method for such sequential dates are to replace the missing date with the middle date of the neighboring rows. 
For some ways to impute missing datetime objects, please visit [this blog](https://medium.com/@drnesr/filling-gaps-of-a-time-series-using-python-d4bfddd8c460).
If, however, the dates are not sequential, you can also replace the missing dates with the starting unix time (1970/01/01).

In [None]:
landslides['date'].fillna("1970-01-01", inplace=True)
landslides['date'] = pd.to_datetime(landslides['date'], dayfirst=False, yearfirst=False)
landslides['date']

### Filtering

Finally, datetime series allows us to quickly query rows within a certain date range with boolean indexing. 
Normally, numerical comparison between `string` objects doesn't make too much sense, but datetime comparisons work because dates that occur later are considered larger than dates that occured earlier.

To filter out the rows that fall between two dates, we simply have to find all the dates that are "greater" than the `start` date and "less" than the end date.

The expression below gets all the landslide entries between Jan 1, 2007 and Jan 1, 2010:

In [None]:
landslides[(landslides['date'] > '2007-01-01') & (landslides['date'] < '2010-01-01')]

Unnamed: 0,id,date,time,continent_code,country_name,country_code,state/province,population,city/town,distance,...,geolocation,hazard_type,landslide_type,landslide_size,trigger,storm_name,injuries,fatalities,source_name,source_link
0,34,2007-03-02,Night,,United States,US,Virginia,16000,Cherry Hill,3.40765,...,"(38.600900000000003, -77.268199999999993)",Landslide,Landslide,Small,Rain,,,,NBC 4 news,http://www.nbc4.com/news/11186871/detail.html
1,42,2007-03-22,,,United States,US,Ohio,17288,New Philadelphia,3.33522,...,"(40.517499999999998, -81.430499999999995)",Landslide,Landslide,Small,Rain,,,,Canton Rep.com,http://www.cantonrep.com/index.php?ID=345054&C...
2,56,2007-04-06,,,United States,US,Pennsylvania,15930,Wilkinsburg,2.91977,...,"(40.4377, -79.915999999999997)",Landslide,Landslide,Small,Rain,,,,The Pittsburgh Channel.com,https://web.archive.org/web/20080423132842/htt...
3,59,2007-04-14,,,Canada,CA,Quebec,42786,Châteauguay,2.98682,...,"(45.322600000000001, -73.777100000000004)",Landslide,Riverbank collapse,Small,Rain,,,,Le Soleil,http://www.hebdos.net/lsc/edition162007/articl...
4,61,2007-04-15,,,United States,US,Kentucky,6903,Pikeville,5.66542,...,"(37.432499999999997, -82.493099999999998)",Landslide,Landslide,Small,Downpour,,,0.0,Matthew Crawford (KGS),
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,1367,2009-12-18,,,United States,US,Mississippi,2175,Purvis,17.40130,...,"(31.134499999999999, -89.227599999999995)",Landslide,Mudslide,Small,Downpour,,,0.0,,http://www.clarionledger.com/article/20091219/...
1294,6253,2007-06-01,,,United States,US,Colorado,1864,Granby,5.53226,...,"(40.0929, -105.87520000000001)",Landslide,Landslide,Medium,Unknown,,0.0,0.0,Sky-Hi News,http://www.skyhidailynews.com/news/13393638-11...
1302,6303,2009-06-11,,,United States,US,Utah,48174,Logan,1.79637,...,"(41.738500000000002, -111.81319999999999)",Landslide,Landslide,Small,Flooding,,0.0,3.0,Salt Lake Tribune,http://www.sltrib.com/news/1739780-155/logan-c...
1345,6585,2008-05-11,5:45,,United States,US,Maryland,19096,Camp Springs,1.87540,...,"(38.816200000000002, -76.921599999999998)",Landslide,Other,Medium,Continuous rain,,0.0,0.0,Hazard Mitigation Plan,http://www.princegeorgescountymd.gov/sites/Sus...


## Basic String Processing 

In terms of string processing, we can start off with very simple string methods. There are of course the standard python
string processing functions but since we would want to parallelize and speed up our data pipeline we'll try to use
the pandas `pd` versions as much as possible. The first one we'll use is `pd.series.str.replace("a", "b")`. An example of how we
use this function is in the next cell:

In [0]:
landslides["state/province" = landslides["state/province"].replace("Virginia", "VA")
landslides.head()

Ignoring the final `.head()` at the end of the expression, we can see that this method goes through every single 
row in our column and replace all strings with the pattern `a` with `b` instead. So specifically for the code above,
all instances of `Virginia` was replaced with `VA` in case we find it easier to work with state/province codes instead.

We can do the same thing for every state in terms of converting from state names to region code,
but it would be extremely annoying and tedious to call a replace statement for every single state. A far more
efficient way to replace states with their codes would be to use a dictionary instead:

In [None]:
landslides["state/province" = landslides["state/province"].replace({
    "Virginia": "VA",
    "Ohio": "OH",
    "Pennsylvania": "PA",
    "Kentucky": "KY",
    "Quebec": "QC"
})
landslides.head(5)

0    VA
1    OH
2    PA
3    QC
4    KY
Name: state/province, dtype: object

Another usage of the `.replace` method is to further clean up string columns. For the column `landslides["source_link"]`
there are many web URLs that start with `https` rather than `http` which could potentially slow down and break a 
web scraping module for example.

In [None]:
landslides["source_link"].head(5)

0        http://www.nbc4.com/news/11186871/detail.html
1    http://www.cantonrep.com/index.php?ID=345054&C...
2    https://web.archive.org/web/20080423132842/htt...
3    http://www.hebdos.net/lsc/edition162007/articl...
4                                              Missing
Name: source_link, dtype: object

The `.replace` method can help us in this situation as well. By specifying a `regex` parameter we can replace substrings
with the new string rather than trying to find an exact match with the string we put in (naive exact matching). 

`regex` is a string matchng library, but don't worry about its syntax or uses too much as it is outside the scope of this module.

In [0]:
landslides["source_link"].replace("https", "http", regex = True).head(5)

Next up is `pd.series.str.contains("a")` which determines whether or not a cell contians a pattern `a` and return `True`
if it does. An example of this method in action would be to determine whether or not the `landslides["source_name"]`
originates from the Red Cross which is shown below:

In [None]:
landslides[landslides["source_name"].str.contains("Red Cross")]

Unnamed: 0,id,date,time,continent_code,country_name,country_code,state/province,population,city/town,distance,...,hazard_type,landslide_type,landslide_size,trigger,storm_name,injuries,fatalities,source_name,source_link,month_name
8,105,2007-06-27,Missing,SA,Ecuador,EC,Zamora-Chinchipe,15276,Zamora,0.47714,...,Landslide,Landslide,Medium,Downpour,Missing,0.0,0.0,Red Cross - Field reports,https://www-secure.ifrc.org/dmis/prepare/view_...,June
9,106,2007-06-27,Missing,SA,Ecuador,EC,Loja,117796,Loja,0.35649,...,Landslide,Landslide,Medium,Downpour,Missing,0.0,0.0,Red Cross - Field reports,https://www-secure.ifrc.org/dmis/prepare/view_...,June
10,107,2007-06-27,Missing,SA,Ecuador,EC,Pichincha,5114,Sangolquí,33.94603,...,Landslide,Landslide,Medium,Downpour,Missing,0.0,0.0,Red Cross - Field reports,https://www-secure.ifrc.org/dmis/prepare/view_...,June


From the table displayed above, we can see that this method is good for looking up relevant columns that contain a
certain string only.

The third pandas string method is `pd.series.str.get` which simply returns the `i`th element of the string. For
example let's say that for some obscure reason we want the 3rd letter of every country, we can easily do that
by calling the following:

In [None]:
landslides["country_name"].str.get(3)

0       t
1       t
2       t
3       a
4       t
       ..
1688    t
1689    t
1690    t
1691    t
1692    t
Name: country_name, Length: 1691, dtype: object

We can achieve the same results using list indexing on strings, but this method is normally faster. 

Last but not least is `pd.series.str.slice` which is similar to `.get` except we can take _ranges_ of a string.
Similar to how we can get the third letter of a country name, we can also take the _first three_ letters of a country
name and treat it as our country abbrieviation by running the following code:

In [None]:
landslides["country_name"].str.slice(0, 3)

0       Uni
1       Uni
2       Uni
3       Can
4       Uni
       ... 
1688    Uni
1689    Uni
1690    Uni
1691    Uni
1692    Uni
Name: country_name, Length: 1691, dtype: object

Data science is not all about using statistical methods to discover patterns in the data, in fact, 
most data analysis projects allocate most of the time to pre-processing the data and cleaning the data. 

Before beginning your analysis, always make sure that the data is clean and in the format you want to eliminate the possibility of bugs and 
erroneous conclusions.

In the next module, we'll be going over a very useful tool for you to get to know your data better and discover some of the patterns in your data -- Data visualization. See you there