<a href="https://pandas.pydata.org/">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2560px-Pandas_logo.svg.png" width="300px">
</a>


# Pandas and Data Wrangling



## Objectives
You could go through this notebook while we present the lecture on Pandas and Data wrangling for having a hands - on experience, while you've been taught. You can also run it at your own leisure. 
We intend to make you familiar with Data Frames and how to use Pandas clean them, analyse them and visualise them.

After completing this notebook you will be able to:
*   Understand Pandas library and its features.
*   Import your data
*   Understand your data, its size, features and its structure.
*   Handle missing values
*   Correct data format
*   Do Statistical Analysis, Standardize and normalize data
*   Visualise data using Matplotlib library


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ul>
    <li><a href="#Import_Libraries_and_the_Data_Set">Import Libraries and the Data Set</a></li> 
    <li><a href="#Discover and understand the data set">Discover and understand the data set</a></li> 
    <li><a href="#Identify_missing_Data">Identify and handle missing values</a></li>     
    <li><a href="#deal_missing_values">Deal with missing values</li>
    <li><a href="#correct_data_format">Correct data format</a></li>
    <li><a href="https://#data_standardization">Data standardization</a></li>
    <li><a href="https://#data_normalization">Data normalization (centering/scaling)</a></li>
    
</ul>

</div>

<hr>


## Tabular Data analysis with Pandas Library
Pandas library is widely used to discover, clean and structure the data 


> _Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language._

From https://pandas.pydata.org/.

<h2>What is the purpose of data wrangling?</h2>


Data Wrangling is the process of cleaning the raw data set and convert the information into a format that it is compatible for analysis. Part of this process include data discovery, find and delete or replace missing values, change the data format into usable form and data visualisation.
While data cleaning involves removing erroneous data from your data set data wrangling involves more steps and processes.  

Data Wrangling is the first and essential part of data analysis, however it is often  the most time-consuming and tedious part of it. It is very important to understand the process and the code lines as you will often use it before your data analysis. 


### Data Wrangling Steps
The exact tasks required  basic steps include:

* **Discovering** : 
Understand the  data and find out  what information is useful for your problem 
* **Structuring** :
Standardise the data format for disparate types of data and make the data usable for automated or semiautomated data analysis. The data must be structured to fit the analytics model 

* **Cleaning** : 
There are outliers in any dataset that could alter the outcome of an analysis. This means that structured data must be cleaned to improve analysis. This involves changing null values, eliminating redundancies, standardizing formatting, and changing redundancies to improve data consistency


* **Enriching**
You are pretty familiar with the data at this point. This is the moment to ask yourself if you want to enhance the data. Are you looking to add other data to it?


*  **Validating** 
This step involves iterative programming steps that authenticate your data’s quality and safety. For example, you may have problems if your data is not clean or enriched and the attributes are not distributed evenly.


## Tabular Data analysis with Pandas Library
Pandas library is widely used to discover, clean and stracture the data 


> _Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language._

From https://pandas.pydata.org/.

## Visualise Data 
Visualisation libraries such as matplot lib and seaborn are often used to plot and statistically analyse data.
This way its easy to discover and clean 'outliners' which are data points that don't make sense and often mess the data.

### ---------------------------------------------------------------


<h2 id="Import_Libraries_and_the_Data_Set"> 1) Import Libraries and the Data Set </h2>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt


###  Reading the dataset:
You might upload the data from a URL or from the URL
or 
Directly from your computer


### From URL
First, we assign the URL of the dataset to "filename" as the next code example. It doesn't actually apply to the case here.

In [2]:
#filename = "https://....the website/data.csv"

### Reading the Dataset from a file in your computer 
If the file is at the same folder with your code the best practice is to use as path="./data.csv"
If there are not at the same folder than you need to use the path of data folder.

In [3]:

#NOTE: Change the path to much yours
#my_path = "/Users/yourName /TeamCoders_Event_Based_Model/2_DataCleaningAndWrangling/who_suicide_statistics_modified3.csv"
my_path = "./data/who_suicide_statistics_modified3.csv" # You shorten the above by starting with "./" which means to read the file from the current directory that the Jupyter notebook is   ./
data_raw = pd.read_csv(my_path , header=0)


 <h2 id="Discover_and_understand_the_Data_Set"> 2) Discover and understand the data set. </h2>
Display the data and read it. Discover the data by displaying it using the following methods:

- **.head()** 
- **.tail()**

Investigate data with :
- **.info()**
- **.describe()**

Find the shape (columns and rows) and what type of data are as values using the following.
- **.shape**
- **.dtypes**
- the function **len(data)**


Use the method <b>head()</b> to display the first five rows of the dataframe.


In [4]:
len(data_raw)

27840

In [5]:
data_raw.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,HDI for year,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21,312900,,2156624900
1,Albania,1987,male,35-54 years,16,308000,,2156624900
2,Albania,1987,female,15-24 years,14,289700,,2156624900
3,Albania,1987,male,75+ years,1,21800,,2156624900
4,Albania,1987,male,25-34 years,9,274300,,2156624900


In [6]:
data_raw.shape

(27840, 8)

In [7]:
data_raw.dtypes

country                object
year                    int64
sex                    object
age                    object
suicides_no            object
population              int64
HDI for year          float64
 gdp_for_year ($)      object
dtype: object

## Exercise 1)
Use the tail method to see the last part of the data.
How many lines do you see?
Use the the tail method again to see the last 20 lines of the data.

In [8]:
data_raw.tail(20) 

Unnamed: 0,country,year,sex,age,suicides_no,population,HDI for year,gdp_for_year ($)
27820,Republic of Korea,1992,male,15-24 years,48.0,4456500,,350051111253
27821,Bulgaria,2003,male,5-14 years,6.0,407671,,20982685981
27822,South Africa,2009,female,75+ years,3.0,502919,,297216730669
27823,Canada,2006,female,25-34 years,118.0,2177957,,1315415197461
27824,Austria,2003,female,5-14 years,,457342,,261695778781
27825,Saint Vincent and Grenadines,2009,female,5-14 years,,9944,,674922481
27826,Azerbaijan,2003,female,35-54 years,12.0,1137200,,7276013032
27827,Suriname,2010,male,15-24 years,18.0,48112,0.707,4368398048
27828,Seychelles,1987,female,25-34 years,,4800,,249267040
27829,Puerto Rico,1992,female,75+ years,2.0,82500,,34630430000


## Exercise 2)
Use the info method to gather information  about the data. 


In [9]:
# Write your code here
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27840 entries, 0 to 27839
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             27840 non-null  object 
 1   year                27840 non-null  int64  
 2   sex                 27840 non-null  object 
 3   age                 27840 non-null  object 
 4   suicides_no         23575 non-null  object 
 5   population          27840 non-null  int64  
 6   HDI for year        8368 non-null   float64
 7    gdp_for_year ($)   27840 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 1.7+ MB


## Exercise 3)
Use the len () function  to find the length of your data. 


In [10]:
# Write your code here
len(data_raw)

27840

### Observations
* The data frame has 27841 rows and 8 columns that describe the demographics of the suicides in the world.
* There is no headers, the columns are indexed. 
* The header is in the first row ( 0 index) we need to move the row and make it header.
* The data frame is an object by itself and within it the data all columns are all objects. 
* It is not the data structure that is compartible with the analysis we will follow.

## Exercise 4 )
Write any other observation that you see about the data. 

Whats does NaN mean? Is this missing data?


## **2.1) Fix the headers (column names).** 
If the data frame hasn't got the column names assigned, or there are at the wrong place, we could restore the headers by the following three ways:


### **A) How to replace the header with the first row.**
In our data set the description of the columns are included in the first row ( 0 ). We need to replace the headers with the first row. We need to:
- 1) locate the first row and assign it as columns
- 2) select the dataframe from the 2nd row downwards
- 3) reset the index of teh dataframe 

* First row of the dataframe is assigned to the df.columns using the **df.iloc[0]** statement
* Next, the dataframe is sliced from the second row using its index 1 (using **.iloc[1:]**) 
* Within the same line we reset its row index using the **reset_index()** method.
* With these steps, the header of the dataframe is replaced with the first row of the dataframe.


### A) Method to add the headers that exist at the row (0) 

In [11]:
# data_raw.columns = data_raw.iloc[0]
# data_raw = data_raw.iloc[1:].reset_index(drop=True) #This method will  reset the index of the rows 

# data_raw.head()

We recommend you to run only one method when you first run the notebook.
We have run the method (A) so we don't need to run (B) and (C), therefore I have them commented(place # in front). If you want to run them you need to remove the (#). 

### B ) Method 
We need to create headers with the descriptions which are in the 1st line (row 0 ) and then delete row 0.
The following code is commented (having the ## in front) so it cannot run as code. To run it, don't run the code of method A, and delete the # in front of each line.

We create a Python list **headers** containing name of headers.


In [12]:
##headers = ["country","year","gender","age","suicides_no","population","HDI for year","gdp_for_year ($)"]

In [13]:
# data_raw.columns = headers            # 1) Add Headers as columns 
# data_raw= data_raw.drop(0, axis =0)   # 2) delete the row with the labels , which is row 0 
#data_raw.head()                        # 3) check the five lines of the Data Frame

###  C) Method

This method will only work when the headers are in the 1st line (line 0). If there are not headers at all, you need to use method B to insert them manually.
Use the Pandas function  read_csv('filename.csv') to load the data from the file. 


In [14]:
# df = pd.read_csv(my_path)

Use the method **.head()** to display **the first five rows** of the dataframe , but when you insert an argument it displays the number you give.


In [15]:
# To see what the data set looks like, we'll use the head() method for the first 12 lines.
data_raw.head(12)


Unnamed: 0,country,year,sex,age,suicides_no,population,HDI for year,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21.0,312900,,2156624900
1,Albania,1987,male,35-54 years,16.0,308000,,2156624900
2,Albania,1987,female,15-24 years,14.0,289700,,2156624900
3,Albania,1987,male,75+ years,1.0,21800,,2156624900
4,Albania,1987,male,25-34 years,9.0,274300,,2156624900
5,Albania,1987,female,75+ years,1.0,35600,,2156624900
6,Albania,1987,female,35-54 years,6.0,278800,,2156624900
7,Albania,1987,female,25-34 years,4.0,257200,,2156624900
8,Albania,1987,male,55-74 years,1.0,137500,,2156624900
9,Albania,1987,female,5-14 years,,311000,,2156624900


In [16]:
# To see what the data set looks like, we'll use the head() method for the first 12 lines.
# data_raw.head(12)

## **2.2) Understand the data set- Select columns- get values**
The way to to select columns is to write the 
**dataframe['column_name']**
Lets show all the countries in the data set.

In [17]:
data_raw_countries=data_raw['country']
print(data_raw_countries)

0            Albania
1            Albania
2            Albania
3            Albania
4            Albania
            ...     
27835        Belgium
27836       Thailand
27837    Netherlands
27838        Grenada
27839         Mexico
Name: country, Length: 27840, dtype: object


### **How we can count values within the column**
Use the **.value_counts().** method

In [18]:
data_raw['country'].value_counts() #select the column with data_raw['country'] then add the method .value_counts()  ##What is this counting??

Netherlands               383
Austria                   383
Mauritius                 382
Iceland                   382
Puerto Rico               373
                         ... 
Bosnia and Herzegovina     24
Macau                      12
Dominica                   12
Cabo Verde                 12
Mongolia                   10
Name: country, Length: 101, dtype: int64

In [19]:
print((data_raw_countries).value_counts())

Netherlands               383
Austria                   383
Mauritius                 382
Iceland                   382
Puerto Rico               373
                         ... 
Bosnia and Herzegovina     24
Macau                      12
Dominica                   12
Cabo Verde                 12
Mongolia                   10
Name: country, Length: 101, dtype: int64


### **How to make selections in columns and make lists** 
To find all the values in a column that appeared at list once we use the unique method **.unique()**
We use it here to list all the countries alphabetically and to set them as 'string" data types.  


In [20]:
data_raw_countries_alphabetically = data_raw['country'].unique().astype(str)
data_raw_countries_alphabetically

array(['Albania', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Barbados', 'Belarus', 'Belgium', 'Belize',
       'Bosnia and Herzegovina', 'Brazil', 'Bulgaria', 'Cabo Verde',
       'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Croatia', 'Cuba',
       'Cyprus', 'Czech Republic', 'Denmark', 'Dominica', 'Ecuador',
       'El Salvador', 'Estonia', 'Fiji', 'Finland', 'France', 'Georgia',
       'Germany', 'Greece', 'Grenada', 'Guatemala', 'Guyana', 'Hungary',
       'Iceland', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan',
       'Kazakhstan', 'Kiribati', 'Kuwait', 'Kyrgyzstan', 'Latvia',
       'Lithuania', 'Luxembourg', 'Macau', 'Maldives', 'Malta',
       'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Netherlands',
       'New Zealand', 'Nicaragua', 'Norway', 'Oman', 'Panama', 'Paraguay',
       'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar',
       'Republic of Korea', 'Romania', '

In [21]:
type(data_raw_countries_alphabetically)

numpy.ndarray

The data_raw_countries_alphabeticallly is a numpy array that we need to convert it to list, using the **.tolist()** method.


In [22]:
data_countries_list= data_raw_countries_alphabetically.tolist() # Convert a numpy array to a list
data_countries_list
#print(f'The list of countries is:{data_countries_list}')

['Albania',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Bosnia and Herzegovina',
 'Brazil',
 'Bulgaria',
 'Cabo Verde',
 'Canada',
 'Chile',
 'Colombia',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Dominica',
 'Ecuador',
 'El Salvador',
 'Estonia',
 'Fiji',
 'Finland',
 'France',
 'Georgia',
 'Germany',
 'Greece',
 'Grenada',
 'Guatemala',
 'Guyana',
 'Hungary',
 'Iceland',
 'Ireland',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Kazakhstan',
 'Kiribati',
 'Kuwait',
 'Kyrgyzstan',
 'Latvia',
 'Lithuania',
 'Luxembourg',
 'Macau',
 'Maldives',
 'Malta',
 'Mauritius',
 'Mexico',
 'Mongolia',
 'Montenegro',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'Norway',
 'Oman',
 'Panama',
 'Paraguay',
 'Philippines',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Korea',
 'Romania',
 'Russian Federation',
 'Saint 

In [23]:
#Find the length of the list
len(data_countries_list)

101

### Data columns Investigation 
### Find the mean of a numerical column
The method to find the mean of a column is **.mean()**

In [25]:
data_raw_suicides = data_raw['suicides_no']
data_raw_suicides

0         21
1         16
2         14
3          1
4          9
        ... 
27835      6
27836    152
27837     21
27838    NaN
27839      7
Name: suicides_no, Length: 27840, dtype: object

In [26]:
data_raw_suicides_mean = data_raw['suicides_no'].mean()

TypeError: can only concatenate str (not "int") to str

### ERROR !
 We observe that:
There are  string values at the suicide_no column and we cannot find the mean with the .mean() method because it applies only to integers or strings!

Lets to display all the unique values in the column of 'suicides_no' with the method **.unique()** and change the type of these values to integers using typecasting **.astype(int)**.
 

In [104]:
data_raw_suicides = data_raw['suicides_no'].unique().astype(int) # Convert the data into the suicides_no column into an integer
data_raw_suicides

ValueError: cannot convert float NaN to integer

### Error and Observations:
   * There are many NaN (Not a Number) data points in the suicide_no column, that indicate missing values and these cannot converted into integer.
   * As displayed at the data_raw above the HDI column is dominated by NaN. We need to delete the column with the Human Development Index because we will not use it for the analysis and it has lots of missing values that we cannot deal with them. We will also save space and time. 
   


# 3) **How to work with missing data?**
As we can see, there are several NaN (not a number) in the data frame, it might also be word  Null or any other words and string values. Those are missing values which may hinder our further analysis. I'm showing here, how do we identify all those missing values and work with missing data in order to bring our Data Frame at a format that is ready for analysis.

Steps for working with missing data:

</ol>
    <li>Identify missing data</li>
    <li>Deal with missing data</li>
    <li>Correct data format</li>
</ol>





<h2 id="Identify_missing_Data">3.1 Identify missing Data</h2>


### **3.1.1 Find missing data**
The missing values are converted by default to Null. We use the following methodsto identify these missing values:

- **.isnull()**
- **.notnull()**
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.



When we use the **isnull()** method the output is True when there is Null, and False when there is a real value. The opposite happens when we use the .notnull() method

In [27]:
(data_raw.isnull()).value_counts() # Indicates with True that there are Nulls and counts the values that are numbers 

country  year   sex    age    suicides_no  population  HDI for year   gdp_for_year ($) 
False    False  False  False  False        False       True          False                 16357
                                                       False         False                  7218
                              True         False       True          False                  3115
                                                       False         False                  1150
dtype: int64

There are Null values in just two columns at the suicides_no and at the HDI for year column

The opposite we observe with **.notnull()** Similar to not null is to use the **(~)** in front of the dataset and then the .isnull(). The tilde sign **(~)** indicates negation.  

In [28]:
(data_raw.notnull()).value_counts() # Indicates with True that we have values and with False where there are Nulls and counts the values that are numbers (not Null)

country  year  sex   age   suicides_no  population  HDI for year   gdp_for_year ($) 
True     True  True  True  True         True        False         True                  16357
                                                    True          True                   7218
                           False        True        False         True                   3115
                                                    True          True                   1150
dtype: int64

In [29]:
(~data_raw.isnull()).value_counts() # The ~ indicates negation and here in front of isnull results to notnull()

country  year  sex   age   suicides_no  population  HDI for year   gdp_for_year ($) 
True     True  True  True  True         True        False         True                  16357
                                                    True          True                   7218
                           False        True        False         True                   3115
                                                    True          True                   1150
dtype: int64

#### We can have the same result but reverted Booleans by using the **.notnull()** method.

**(~data_raw.isnull()).value_counts()** # The tilde symbol indicates negation and here in front of isnull results to notnull()

**(data_raw.notnull()).value_counts()** # Indicates with False that there are Nulls and counts the values that are numbers 


In [30]:
missing_data = data_raw.isnull() # find data that are null and return 'True' if is null and 'False' if it is not
missing_data.head(20)

Unnamed: 0,country,year,sex,age,suicides_no,population,HDI for year,gdp_for_year ($)
0,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,True,False
5,False,False,False,False,False,False,True,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,True,False,True,False


In [31]:
missing_data = data_raw.isnull() # find data that are null and return 'True' if is null and 'False' if it is not
missing_data.head(20)

Unnamed: 0,country,year,sex,age,suicides_no,population,HDI for year,gdp_for_year ($)
0,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,True,False
5,False,False,False,False,False,False,True,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,True,False,True,False


### **3.1.2.Count the missing values in the data frame**.

### Count missing values in each column

Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset.  In the body of the for loop the method **".value_counts()"** counts the number of data values, (number of  "False" here). 




In [32]:
missing_data=(data_raw.isnull())

In [33]:
type(missing_data)

pandas.core.frame.DataFrame

In [34]:
# The for loop 
for column in data_raw.columns: # Select each column-name in the header and prints it 
    print(column)
    print(missing_data[column].value_counts())
    print('----')

country
False    27840
Name: country, dtype: int64
----
year
False    27840
Name: year, dtype: int64
----
sex
False    27840
Name: sex, dtype: int64
----
age
False    27840
Name: age, dtype: int64
----
suicides_no
False    23575
True      4265
Name: suicides_no, dtype: int64
----
population
False    27840
Name: population, dtype: int64
----
HDI for year
True     19472
False     8368
Name: HDI for year, dtype: int64
----
 gdp_for_year ($) 
False    27840
Name:  gdp_for_year ($) , dtype: int64
----


In [None]:
#The commented code give the same result as the one that follows but is shortened as puts the methods together.
#for column in missing_data.columns.values.tolist():  # Take the data Frame missing_data,selects the columns and their values and makes a list of them.  
 #print(column)
 #print (missing_data[column].value_counts())
 ## print("") 

An easier way is to show the data set 

In [35]:
missing_data = data_raw.isnull() # find data that are null and return 'True' if is null and 'False' if it is not
missing_data.head(20)

Unnamed: 0,country,year,sex,age,suicides_no,population,HDI for year,gdp_for_year ($)
0,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,True,False
5,False,False,False,False,False,False,True,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,True,False,True,False


As discussed the only columns that contain missing data are the "suicides_no'and the HDI.

In [36]:
print (data_raw.isnull().sum()) # Find the data that are Null in the whole data frame  

country                   0
year                      0
sex                       0
age                       0
suicides_no            4265
population                0
HDI for year          19472
 gdp_for_year ($)         0
dtype: int64


The length of the dataframe is 27841 and the column suicides_no has 23575 numbers. It is shorter than the total length of the data 27841 by 27841- 23575= 4265. It is indicated as True (is null) in the count. The missing values therefore there must be strings characters in the column. We need to investigate it.

The HDI column has lots of missing data 19472 values are missing. We will not use it for the analysis and it will be good to delete this column.

## **3.2 Investigation for alpharithmetic characters or words misplaced as values**
There are two ways to identify the non-integer values in the column suicides_no
1) Using the module Regular expressions which we need to import as re. 
A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

2) Find the non numeric data with the method **.str.isdigit()** and set it to False that way it collects all the data points that are strings.

### 1st Way Regular expressions module **re**

In [37]:
# Import module Regular expressions as re.
#The re module provides an interface to the regular expression engine, 
# allowing you to compile REs into objects and then perform matches with them
import re 
replace = re.compile("([a-zA-Z]+)")  # compile any alphabetic character

data_raw['string'] = data_raw['suicides_no'].str.extract(replace) # Adds a column string to the data set with all the strings which have letters
data_raw['integer'] = data_raw['suicides_no'].str.replace(replace, " ") #Adds a column integer to the data set 

data_raw['string'].unique() # Presents the unique "words" from the string column

array([nan, 'Null', 'Unknown'], dtype=object)

 In the column suicide-no there are the strings **'Null'**, **'Unknown'** and **nan (not a number)** and we need to replace them.

### 2nd Way without is to use the **.isdigit()** method to a string.


The following code takes all the strings in the column which have digits and sets it to False. Which means that takes all the strings that have no digits. Using the method **unique()** identifies which are these strings.

In [38]:
non_numeric=data_raw['suicides_no'].str.isdigit() == False

In [39]:
type (non_numeric)# The non_numeric is 1D array, Series 

pandas.core.series.Series

In [40]:
data_raw['suicides_no'][non_numeric].unique()

array(['Null', 'Unknown'], dtype=object)

We observe that there are not only the NaN strings but there are also the "Null" and "Unknown" strings and we need to remove them

We cannot see the NaN values and we apply the pd.numeric (explain)

It's an elegant way to find the strings and words in the datasets without the regular expresions, however it doesn't show us the NaN datapoints. We can add this with the following coding which is more complete and shorten than regular expresions (re).

In [41]:
suicides =data_raw['suicides_no']

In [42]:
suicides2 = suicides.apply(pd.to_numeric, errors='coerce')

In [43]:
print(data_raw['suicides_no'][suicides2.isna()].unique())# Shows the data which are not numeric

[nan 'Null' 'Unknown']


The above shows that in the data there are points which as NaN (not an number) and have the word Null and Unknown.

In [44]:
data_raw.head(15)

Unnamed: 0,country,year,sex,age,suicides_no,population,HDI for year,gdp_for_year ($),string,integer
0,Albania,1987,male,15-24 years,21.0,312900,,2156624900,,21.0
1,Albania,1987,male,35-54 years,16.0,308000,,2156624900,,16.0
2,Albania,1987,female,15-24 years,14.0,289700,,2156624900,,14.0
3,Albania,1987,male,75+ years,1.0,21800,,2156624900,,1.0
4,Albania,1987,male,25-34 years,9.0,274300,,2156624900,,9.0
5,Albania,1987,female,75+ years,1.0,35600,,2156624900,,1.0
6,Albania,1987,female,35-54 years,6.0,278800,,2156624900,,6.0
7,Albania,1987,female,25-34 years,4.0,257200,,2156624900,,4.0
8,Albania,1987,male,55-74 years,1.0,137500,,2156624900,,1.0
9,Albania,1987,female,5-14 years,,311000,,2156624900,,


In [45]:
# df_null=data_frame.loc[data_frame['suicides_no']=='Null']
# df_null

NameError: name 'data_frame' is not defined

In [46]:
data_raw.head(15)

Unnamed: 0,country,year,sex,age,suicides_no,population,HDI for year,gdp_for_year ($),string,integer
0,Albania,1987,male,15-24 years,21.0,312900,,2156624900,,21.0
1,Albania,1987,male,35-54 years,16.0,308000,,2156624900,,16.0
2,Albania,1987,female,15-24 years,14.0,289700,,2156624900,,14.0
3,Albania,1987,male,75+ years,1.0,21800,,2156624900,,1.0
4,Albania,1987,male,25-34 years,9.0,274300,,2156624900,,9.0
5,Albania,1987,female,75+ years,1.0,35600,,2156624900,,1.0
6,Albania,1987,female,35-54 years,6.0,278800,,2156624900,,6.0
7,Albania,1987,female,25-34 years,4.0,257200,,2156624900,,4.0
8,Albania,1987,male,55-74 years,1.0,137500,,2156624900,,1.0
9,Albania,1987,female,5-14 years,,311000,,2156624900,,


#### In the column suicide-no there are the strings 'Null', 'Unknown' and NaN and we need to replace them.

<h2 id="deal_missing_values">4 Deal with missing data</h2>
<b>How to deal with missing data?</b>

<ol>
    <li>Drop data<br>
        a. Drop the whole row<br>
        b. Drop the whole column
    </li>
    <li>Replace data<br>
        a. Replace it by mean<br>
        b. Replace it by frequency<br>
        c. Replace it based on other functions
    </li>
</ol>



### 4.1  Drop / Delete a column with lots of missing data.
To remove a column we use the .drop() method as **df.drop(["column1, column2, ... "], axis=1)**. the axis =1 indicates that I apply the method to columns.
 * I will cut the column HDI for year, There are a lot of missing values there and I don't think that if contributes to the result.
 
 * I will name the new Data Frame **data_frame** because it is not the raw data anymore


In [47]:

data_frame= data_raw.drop(['HDI for year','string','integer'], axis = 1)                         
data_frame

Unnamed: 0,country,year,sex,age,suicides_no,population,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21,312900,2156624900
1,Albania,1987,male,35-54 years,16,308000,2156624900
2,Albania,1987,female,15-24 years,14,289700,2156624900
3,Albania,1987,male,75+ years,1,21800,2156624900
4,Albania,1987,male,25-34 years,9,274300,2156624900
...,...,...,...,...,...,...,...
27835,Belgium,2011,female,25-34 years,6,707535,527008453887
27836,Thailand,2016,male,75+ years,152,1124052,411755164833
27837,Netherlands,1998,female,15-24 years,21,934500,432476116419
27838,Grenada,2002,female,5-14 years,,11760,540336926


So far we have investigated that there are 4265 missing integers in the suicide_no column which would be the strings 
'NaN', 'Unknown', "Null".

We have deleted the HDI for year colum because it had a lot of missing values

We have created a new data_frame with the data, which is not longer the raw data.

###  4.2 REPLACE :
#### Locate the data in suicides_no that appeared as strings and REPLACE them

In [49]:
df_null=data_frame.loc[data_frame['suicides_no']=='Null']
df_null

Unnamed: 0,country,year,sex,age,suicides_no,population,gdp_for_year ($)
378,Antigua and Barbuda,1994,male,15-24 years,Null,5976,589429593
388,Antigua and Barbuda,1995,female,55-74 years,Null,3808,577280741
401,Antigua and Barbuda,1998,female,75+ years,Null,1476,727860593
558,Antigua and Barbuda,2013,male,15-24 years,Null,8061,1192925407
1360,Aruba,2006,female,55-74 years,Null,8553,2421474860
1384,Aruba,2008,female,5-14 years,Null,7269,2791960894
1389,Aruba,2008,male,75+ years,Null,1333,2791960894
1421,Aruba,2011,female,5-14 years,Null,7064,2584463687
10569,Grenada,2005,female,5-14 years,Null,10711,695370296
10578,Grenada,2006,female,5-14 years,Null,10384,698700667


In [50]:
df_null.shape

(12, 7)

There are 4265 points at the suicides_no with the wrong/missing values. It's exactly what we have calculated with the missing_values variable.

In [51]:
#### Count the values of the 'Null' in 'suicides_no' column 
#data_null["suicides_no"].value_counts()  # Same as:  data_null.suicides_no.value_counts()  
df_null.suicides_no.value_counts()  

### If we want to be concise we could write it elegantly in one line
# data_frame.loc[data_frame['suicides_no']=='Null'].suicides_no.value_counts()

Null    12
Name: suicides_no, dtype: int64

In [52]:
df_unknown=data_frame.loc[data_frame['suicides_no'] =='Unknown'] 
df_unknown  ## show the lines with the word Unknown in the suicides_no column 

Unnamed: 0,country,year,sex,age,suicides_no,population,gdp_for_year ($)
10610,Grenada,2009,female,15-24 years,Unknown,11815,771278111
10615,Grenada,2009,female,75+ years,Unknown,2227,771278111
10622,Grenada,2010,female,15-24 years,Unknown,11637,771015889
10629,Grenada,2010,male,25-34 years,Unknown,9006,771015889
10639,Grenada,2011,female,55-74 years,Unknown,5474,778648667
10660,Grenada,2013,female,35-54 years,Unknown,10858,842620111
10678,Grenada,2014,male,35-54 years,Unknown,11369,911481481
10691,Grenada,2015,male,5-14 years,Unknown,9409,997007926


In [53]:
df_unknown.suicides_no.value_counts() ## count the values with the word Unknown in the suicides_no column 

Unknown    8
Name: suicides_no, dtype: int64

## REPLACE  incorrect / missing values!
There are three types of data in the column 'suicides_no' in our Data Frame that need to be cleared out. The word 'Null',the word "Unknown" and NaN (Not a Number). I will replace the word Null' with 0 , because its is actually zero. 

The strategy with other words which don't actually contribute to data (unless it is categorical data), is to replace them with the string NaN.  I replace " " with NaN (Not a Number), which is Python's default missing value marker, for reasons of computational speed and convenience. Then we can delete all the NaN data points.
- Replace all strings with NaN
- Delete all NaN values

Here we use the functions to use to localise and replace one value with another:

* .loc[data['Column Name']=='Character to localise']
to localise the Characters in the Column name
* .replace(A, B, inplace = True) 
to replace A value  by B value.



In [124]:
# replace an empty space " " and "Unknown" to NaN and the "Null" with 0 using Numpy library!!
data_frame.replace(" ", np.nan, inplace = True)
data_frame.replace("Null", 0, inplace = True)
data_frame.replace("Unknown", np.nan, inplace = True)
data_frame.head(15)

Unnamed: 0,country,year,sex,age,suicides_no,population,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21.0,312900,2156624900
1,Albania,1987,male,35-54 years,16.0,308000,2156624900
2,Albania,1987,female,15-24 years,14.0,289700,2156624900
3,Albania,1987,male,75+ years,1.0,21800,2156624900
4,Albania,1987,male,25-34 years,9.0,274300,2156624900
5,Albania,1987,female,75+ years,1.0,35600,2156624900
6,Albania,1987,female,35-54 years,6.0,278800,2156624900
7,Albania,1987,female,25-34 years,4.0,257200,2156624900
8,Albania,1987,male,55-74 years,1.0,137500,2156624900
9,Albania,1987,female,5-14 years,,311000,2156624900


### Lets count now how many columns are NaN

In [None]:
#df_NaN=data_frame.loc[data_frame['suicides_no'] =='NaN'] 
#df_NaN.suicides_no.value_counts()

Based on the summary above, each column has 27840 rows of data and only one of the columns the suicide_no containes missing data:

<ol>
    <li>"suicides_no": 4273 missing data which are replaced with NaN</li>
    
</ol>


### DELETE/DROP  NaN 
In order to deal with NaN we have to create a numpy array where the NaN appears. We can visualise teh data frame by calling .head() or .tail()

The way to  delete the NaN is  using the *.dropna()*  to the  data. Imediately afterwards you need to  to reset the index of the data frame as there are deletd values 


In [126]:
data_frame.tail(10).reset_index()  # use .reset_index()  to visualise it and count it better

Unnamed: 0,index,country,year,sex,age,suicides_no,population,gdp_for_year ($)
0,27830,Kuwait,2007,male,55-74 years,2.0,99708,114641097818
1,27831,Australia,1995,male,35-54 years,587.0,2507400,367216364716
2,27832,Colombia,1989,male,35-54 years,152.0,2802188,39540080200
3,27833,Argentina,1988,female,5-14 years,11.0,3115000,126206817196
4,27834,Ukraine,2005,female,25-34 years,182.0,3380536,86142018069
5,27835,Belgium,2011,female,25-34 years,6.0,707535,527008453887
6,27836,Thailand,2016,male,75+ years,152.0,1124052,411755164833
7,27837,Netherlands,1998,female,15-24 years,21.0,934500,432476116419
8,27838,Grenada,2002,female,5-14 years,,11760,540336926
9,27839,Mexico,1988,female,75+ years,7.0,614000,183144164357


### Remove the NaN values 
We can see the the entry points 27824, 27825, 27828  have NaN values at the suicides_no column. We use **.dropna()**. Always reset index 

In [128]:
# Now I will cut any data points with NaN since I don't know with what to replace them
data_frame = data_frame.dropna() # deletes the rows with NaN
data_frame.tail(20).reset_index()  ## reset the values 

Unnamed: 0,index,country,year,sex,age,suicides_no,population,gdp_for_year ($)
0,27816,Uzbekistan,2014,female,75+ years,9,348465,63067077179
1,27817,Uzbekistan,2014,male,5-14 years,6,2762158,63067077179
2,27818,Uzbekistan,2014,female,5-14 years,44,2631600,63067077179
3,27819,Uzbekistan,2014,female,55-74 years,21,1438935,63067077179
4,27820,Republic of Korea,1992,male,15-24 years,48,4456500,350051111253
5,27821,Bulgaria,2003,male,5-14 years,6,407671,20982685981
6,27822,South Africa,2009,female,75+ years,3,502919,297216730669
7,27823,Canada,2006,female,25-34 years,118,2177957,1315415197461
8,27826,Azerbaijan,2003,female,35-54 years,12,1137200,7276013032
9,27827,Suriname,2010,male,15-24 years,18,48112,4368398048


In [129]:
data_frame

Unnamed: 0,country,year,sex,age,suicides_no,population,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21,312900,2156624900
1,Albania,1987,male,35-54 years,16,308000,2156624900
2,Albania,1987,female,15-24 years,14,289700,2156624900
3,Albania,1987,male,75+ years,1,21800,2156624900
4,Albania,1987,male,25-34 years,9,274300,2156624900
...,...,...,...,...,...,...,...
27834,Ukraine,2005,female,25-34 years,182,3380536,86142018069
27835,Belgium,2011,female,25-34 years,6,707535,527008453887
27836,Thailand,2016,male,75+ years,152,1124052,411755164833
27837,Netherlands,1998,female,15-24 years,21,934500,432476116419


In [None]:
data_frame

Unnamed: 0,country,year,sex,age,suicides_no,population,gdp_for_year ($)
0,Albania,1987,male,15-24 years,21,312900,2156624900
1,Albania,1987,male,35-54 years,16,308000,2156624900
2,Albania,1987,female,15-24 years,14,289700,2156624900
3,Albania,1987,male,75+ years,1,21800,2156624900
4,Albania,1987,male,25-34 years,9,274300,2156624900
...,...,...,...,...,...,...,...
27834,Ukraine,2005,female,25-34 years,182,3380536,86142018069
27835,Belgium,2011,female,25-34 years,6,707535,527008453887
27836,Thailand,2016,male,75+ years,152,1124052,411755164833
27837,Netherlands,1998,female,15-24 years,21,934500,432476116419


Whole columns or rows should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty.


There is some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method according to the data and what the column represents :

**Replace by mean:** depending on the concept the  missing data could be replaced them with mean
   

#### Calculate the mean value for the "suicides_no" column
It does not make any sense with this data but I display how to substitute with the mean to show the method. 
If you want to use it elsewhere you can remove the # in front of the code and use them 


In [None]:
# How to calculate and show the average of a column
avg_suicide_no=data_frame['suicides_no'].astype('int').mean(axis=0)# First we need to make the values floats (type casting)
print("Average suicides:",avg_suicide_no )

Average suicides: 217.6898205117325


In [None]:
print ("Average suicides integer:", int(avg_suicide_no))

Average suicides integer: 217


Replace "NaN" with the integer average value in the "suicide_no" column.
As written above we will not replace by a mean in this example.
I just show the code to use it when appropriate.



In [None]:
#data_frame['suicides_no'].replace(np.nan, avg_suicide_no, inplace=True)

### Replace by Frequency:
Find the most frequent value ( the mode) of the column and replace the NaN with the most frequent

In [130]:
data_frame['suicides_no'].value_counts()

1       1876
2       1344
3       1001
4        820
5        616
        ... 
1243       1
2114       1
7921       1
7436       1
2872       1
Name: suicides_no, Length: 1576, dtype: int64

There are 1576 different numbers in the column.  We can also use the ".idxmax()" method to calculate the most common value automatically.

In [131]:
data_frame['suicides_no'].value_counts().idxmax()

'1'

We can see that number **'1' is the most common value**.


The replacement procedure is very similar to what we have seen previously:


In [132]:
#replace the missing 'NaN' values in the suicides_no column  by the most frequent number which is 1 
data_frame['suicides_no'].replace(np.nan, 1 , inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame['suicides_no'].replace(np.nan, 1 , inplace=True)


<b>Good!</b> Now, we have a dataset with no missing values.


<h2 id="correct_data_format"> 5 Correct data format </h2>

We are almost there!
The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other)

In Pandas, we use:

**.dtype()** to check the data type
**.astype()**  to change the data type


<h4>Let's list the data types for each column</h4>


As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'year', 'age', 'suicides_no, 'population, and 'gdp_for_year ($)' variables are numerical values, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the **.astype()** method.

<h4>Convert data types to proper format</h4>


In [133]:
data_frame[["year", "suicides_no", "population"]] = data_frame[["year", "suicides_no", "population"]].astype("int")
#Type casting using .astype("float")   .astype("str")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame[["year", "suicides_no", "population"]] = data_frame[["year", "suicides_no", "population"]].astype("int")


## The problems of the column 'gdp_for_year '
* The header of the column has a space before the dolar symbol 'gdp_for_year ($)' and we need to replace it with a   name that is coherent and like 'gdp_for_year_usd'
* The format of the numbers in the column should be without the commas.

### 1st Change the header

In [134]:
headers= data_frame.columns # Select the headers

In [135]:
##There are spaces at the name of the column "  gdp_for_year-usd  " and I'm replacing it with a name without spaces "gdp_for_year_usd"
## #Create a list (headers) and replace with the **lamda* method one 
headers = list(map(lambda x: x.replace(' gdp_for_year ($) ', 'gdp_for_year_usd'), headers)) #Create a list and replace with the **lamda* method one value of the list
headers

['country',
 'year',
 'sex',
 'age',
 'suicides_no',
 'population',
 'gdp_for_year_usd']

In [136]:
data_frame.columns=headers

### 2nd Change the format 

In [137]:
data_frame.dtypes

country             object
year                 int64
sex                 object
age                 object
suicides_no          int64
population           int64
gdp_for_year_usd    object
dtype: object

In [139]:
data_frame['gdp_for_year_usd'] # Show the values, which are of the type object

0          2,156,624,900
1          2,156,624,900
2          2,156,624,900
3          2,156,624,900
4          2,156,624,900
              ...       
27834     86,142,018,069
27835    527,008,453,887
27836    411,755,164,833
27837    432,476,116,419
27839    183,144,164,357
Name: gdp_for_year_usd, Length: 23567, dtype: object

We need to REMOVE the COMMAS ' on the numbers and typecasting them to integers.

In [141]:
# First replace the string comma ',' empty space''. Then typecasting the column to integers
data_frame['gdp_for_year_usd']= data_frame['gdp_for_year_usd'].str.replace(',', '').astype(int)
data_frame['gdp_for_year_usd']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame['gdp_for_year_usd']= data_frame['gdp_for_year_usd'].str.replace(',', '').astype(int)


0          2156624900
1          2156624900
2          2156624900
3          2156624900
4          2156624900
             ...     
27834     86142018069
27835    527008453887
27836    411755164833
27837    432476116419
27839    183144164357
Name: gdp_for_year_usd, Length: 23567, dtype: int64

In [142]:
data_frame['gdp_for_year_usd'].astype("int")

0          2156624900
1          2156624900
2          2156624900
3          2156624900
4          2156624900
             ...     
27834     86142018069
27835    527008453887
27836    411755164833
27837    432476116419
27839    183144164357
Name: gdp_for_year_usd, Length: 23567, dtype: int64

<h4>Let us list the columns after the conversion</h4>


In [143]:
data_frame.dtypes

country             object
year                 int64
sex                 object
age                 object
suicides_no          int64
population           int64
gdp_for_year_usd     int64
dtype: object

In [144]:
data_frame = data_frame.astype({"year" : int,"suicides_no" : int, "population" : int}) # The Suicides_no and GDP_for_Year cannot transformed

In [145]:
data_frame.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,gdp_for_year_usd
0,Albania,1987,male,15-24 years,21,312900,2156624900
1,Albania,1987,male,35-54 years,16,308000,2156624900
2,Albania,1987,female,15-24 years,14,289700,2156624900
3,Albania,1987,male,75+ years,1,21800,2156624900
4,Albania,1987,male,25-34 years,9,274300,2156624900


### **Wonderful!**
#### Save the data

Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.
We can save this data set as it is now the 'clean' data set which is ready for processing. 


In [146]:
data_frame.to_csv(r'./data/who_suicide_statistics_clean_data.csv', index=False)

# **Data Operations**
Data operations can be performed through various built-in methods for faster data processing and analysis. A few methods are:
- Operations using Grouping
- Operations with Sorting
- Operations with Statistics 
- Operations with Functions (Standarisation and Normalisation)




##  Grouping
 Use  **.groupby('..')** Its a function of grouping the data. It involes a combination of **spliting** the object applying a function and **combine** the results **.groupby('column name, value')**

To select the data of a single country from the 101, I'll  group the data frame by 'country' and from that I will select the country of preference by using **.get_group('selection').
Lets select the data for the United Kingdom.

In [147]:
country_grouped= data_frame.groupby(['country'])
UK_data=country_grouped.get_group('United Kingdom')
UK_data

Unnamed: 0,country,year,sex,age,suicides_no,population,gdp_for_year_usd
26476,United Kingdom,1985,male,75+ years,264,1202838,489285164271
26477,United Kingdom,1985,male,55-74 years,915,5170113,489285164271
26478,United Kingdom,1985,male,35-54 years,128,6899879,489285164271
26479,United Kingdom,1985,male,25-34 years,62,3969689,489285164271
26480,United Kingdom,1985,female,55-74 years,678,6002096,489285164271
...,...,...,...,...,...,...,...
26843,United Kingdom,2015,female,25-34 years,181,4414464,2885570309161
26844,United Kingdom,2015,female,75+ years,18,3070457,2885570309161
26845,United Kingdom,2015,female,15-24 years,14,3966564,2885570309161
26846,United Kingdom,2015,female,5-14 years,6,3663221,2885570309161


We make another data frame by selecting the columns that we want to analyse 

In [148]:
UK_data_year_suicides= UK_data[['country','year', 'suicides_no']] # country_grouped.get_group('United Kingdom').
UK_data_year_suicides

Unnamed: 0,country,year,suicides_no
26476,United Kingdom,1985,264
26477,United Kingdom,1985,915
26478,United Kingdom,1985,128
26479,United Kingdom,1985,62
26480,United Kingdom,1985,678
...,...,...,...
26843,United Kingdom,2015,181
26844,United Kingdom,2015,18
26845,United Kingdom,2015,14
26846,United Kingdom,2015,6


To find the total suicides of each year, group two columns together and use the  method **.sum()**


In [149]:
UK_suicides_per_year =UK_data_year_suicides.groupby(['country','year'],as_index=False).sum()
UK_suicides_per_year.head(10)

Unnamed: 0,country,year,suicides_no
0,United Kingdom,1985,3017
1,United Kingdom,1986,4587
2,United Kingdom,1987,3604
3,United Kingdom,1988,4683
4,United Kingdom,1989,3731
5,United Kingdom,1990,3851
6,United Kingdom,1991,3827
7,United Kingdom,1992,4133
8,United Kingdom,1993,4012
9,United Kingdom,1994,4380


## Sorting values 
To sort values either according to their size or alphabetical order we use  the method **.sort_values**

In [150]:
UK_suicides_per_year_sorted=UK_suicides_per_year.sort_values('suicides_no') # The default is ascending order
UK_suicides_per_year_sorted

Unnamed: 0,country,year,suicides_no
17,United Kingdom,2002,2668
13,United Kingdom,1998,2967
12,United Kingdom,1997,2991
0,United Kingdom,1985,3017
22,United Kingdom,2007,3348
26,United Kingdom,2011,3421
19,United Kingdom,2004,3508
18,United Kingdom,2003,3529
2,United Kingdom,1987,3604
30,United Kingdom,2015,3650


## Operations with Statistics 
The methods **.max()**, **.min()**,**.mean()**,**.std()** used to find the maximum, minimum, mean and standard deviation at data. The **.describe** function shows all the statistics.

In [151]:
UK_suicides_per_year_sorted

Unnamed: 0,country,year,suicides_no
17,United Kingdom,2002,2668
13,United Kingdom,1998,2967
12,United Kingdom,1997,2991
0,United Kingdom,1985,3017
22,United Kingdom,2007,3348
26,United Kingdom,2011,3421
19,United Kingdom,2004,3508
18,United Kingdom,2003,3529
2,United Kingdom,1987,3604
30,United Kingdom,2015,3650


In [152]:
print(UK_suicides_per_year['suicides_no'].max())
print(UK_data_year_suicides['suicides_no'].min())
print(UK_suicides_per_year['suicides_no'].std())

4683
1
452.7454838187561


## Operations with Functions Statistical Functions
### Example Statistical Functions of **Normalization** and **Standarization** 
Data is usually collected from different sources in different formats.
Normalization and standardization are the processes of transforming data into a common format, allowing the researcher to make the meaningful comparison.


### **Data Normalization** 
Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 

### **Example**

To demonstrate normalization, let's say we want to scale the columns , "suicides_no" at UK_suicides_per_year data frame.
We would like to normalize those variables so their value ranges from 0 to 1
We replace original value by (original value)/(maximum value)

**Step 1)** Create a function of normalization 

In [153]:
def normalize(number):
    return (number)/number.max()

**Step 2)** Apply the function to the selected column 

In [154]:
normalize(UK_suicides_per_year['suicides_no'])


0     0.644245
1     0.979500
2     0.769592
3     1.000000
4     0.796712
5     0.822336
6     0.817211
7     0.882554
8     0.856716
9     0.935298
10    0.806107
11    0.785608
12    0.638693
13    0.633568
14    0.834508
15    0.802691
16    0.862268
17    0.569720
18    0.753577
19    0.749092
20    0.787316
21    0.822550
22    0.714926
23    0.790305
24    0.906684
25    0.790519
26    0.730515
27    0.856716
28    0.837924
29    0.887892
30    0.779415
Name: suicides_no, dtype: float64

### **Data Standardization**
Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation. The data is transformed into a common format that allows the researcher to make the meaningful comparison.

In [155]:
def standarize_test(number):
    return (number-number.mean())/number.std()

In [156]:
standarize_test(UK_suicides_per_year['suicides_no'])

0    -1.625994
1     1.841738
2    -0.329459
3     2.053778
4    -0.048949
5     0.216101
6     0.163091
7     0.838967
8     0.571709
9     1.384528
10    0.048236
11   -0.163803
12   -1.683421
13   -1.736431
14    0.341999
15    0.012896
16    0.629137
17   -2.396846
18   -0.495115
19   -0.541499
20   -0.146134
21    0.218310
22   -0.894899
23   -0.115211
24    1.088556
25   -0.113002
26   -0.733660
27    0.571709
28    0.377339
29    0.894186
30   -0.227857
Name: suicides_no, dtype: float64

## **Visualise the data**

In [161]:
%matplotlib 
import matplotlib.pyplot as plt
#import seaborn as sns

plt.figure(figsize=(14,6))
data_frame.groupby('country').sum().sort_values(by='suicides_no',ascending=False)[['suicides_no']][:101].plot(kind='bar',figsize=(16,8),title='Sum of suicides by country during 1985-2016')
plt.ylabel("Sum of suicides")
plt.xlabel("Country")

Using matplotlib backend: MacOSX


Text(0.5, 0, 'Country')

In [162]:
# Here it plots the top 20 countries in suicides 
plt.figure(figsize=(14,6))
data_frame.groupby('country').sum().sort_values(by='suicides_no',ascending=False)[['suicides_no']][:20].plot(kind='bar',figsize=(16,8),title='Sum of suicides by country during 30 years 1985-2016')
plt.ylabel("Sum of suicides")


Text(0, 0.5, 'Sum of suicides')



## Exercise 
Calculate the GDP/capita of the countries and visualise the cases according to GDP/capita.



# Convert to GDP/capita

### Thank you for completing this lab!

## Author
##  Mary Tziraki


                                                                                     

## <h3 align="center"> © Health+Bioscience IDEAS 2022. All rights reserved. <h3/>
