## Pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

You can use it to organise data into tables and do calculations on such tables, you will use the
pandas module. A module is a package of various pieces of code that can be used individually. The pandas module provides very
extensive and advanced data analysis capabilities to compliment Python. This course only scratches the surface of pandas.
I have to tell the computer that I’m going to use a module.

In [None]:
import pandas as pd

### Importance of Pandas

- Pandas provide essential data structures like series, dataframes, and panels which help in manipulating data sets and time series.

- It is free to use and an open source library, making it one of the most widely used data science libraries in the world.

- Pandas possess the power to perform various tasks. Whether it is computing tasks like finding the mean, median and mode of data, or a task of handling large CSV files and manipulating the contents according to our will, Pandas can do it all. In short, to master data science, you must be skillful in Pandas.

`From now on, we will be using real life data to learn and practice as we progress through the course. the file/data can be found in the folder that contains this notebook.`

`Data analysis often starts with a question or with some data.`

- What is the total, smallest, largest, and average number of deaths due to TB?
- What is the death rate (number of deaths divided by population) of each country?
- Which countries have the smallest and largest number of deaths?
- Which countries have the smallest and largest death rate?

**Loading the data**

Many applications can read files in Excel format, and pandas can too. Asking the
computer to read the data looks like this:

In [None]:
data = pd.read_excel('WHO POP TB some.xls')
data

The variable name data is not descriptive, but as there is only one dataset in our analysis,
there is no possible confusion with other data, and short names help to keep the lines of
code short.

The function read_excel() takes a file name as an argument and returns the table contained in the file. In pandas, tables are called dataframes . 

To load the data, I simply call the function and store the returned dataframe in a variable. A file name must be given as a string , a piece of text surrounded by quotes. 

The quote marks tell Python that this isn’t a variable, function or module name. Also, the quote marks state that this is a single name, even if it contains spaces, punctuation and other characters besides letters.

Misspelling the file name, or not having the file in the same folder as the notebook containing the code, results in a file not found error.

**Selecting a column**

Now you have the data, let the analysis begin!

Let’s tackle the first part of the first question: ‘What are the total, smallest, largest and
average number of deaths due to TB?’ Obtaining the total number will be done in two
steps: first select the column with the TB deaths, then sum the values in that column.
Selecting a single column of a dataframe is done with an expression in the format:

`dataFrame['column name']`


In [None]:
data['TB deaths']

**Task-1**

Select the population column and store it in a variable, so
that you can use it in later exercises.

In [None]:
# Write your code here

**Calculations on a column**

In [None]:
tbColumn = data['TB deaths']
tbColumn.sum()

In [None]:
tbColumn.min()

In [None]:
tbColumn.max()

Like sum() , the column methods min() and max() don’t need arguments, whereas the Python functions min() and max() did need them, because there was no context (column) providing the values.

The average number is computed as before, dividing the total by the number of countries.

In [None]:
tbColumn.sum() / 12

This kind of average is called the mean and there’s a method for that.

In [None]:
tbColumn.mean()

In [None]:
tbColumn.median()

**Task-2**

Practise the use of column methods by applying them to the population column you
obtained in Task-1

**Sorting on a column**

One of the research questions was: which countries have the smallest and largest number of deaths?

Being a small table, it is not too difficult to scan the TB deaths column and find those countries. However, such a process is prone to errors and impractical for large tables. It’s much better to sort the table by that column, and then look up the countries in the first and last rows.

As you’ve guessed by now, sorting a table is another single line of code.

In [None]:
data.sort_values('TB deaths')

The dataframe method sort_values() takes as argument a column name and returns a new dataframe where the rows are in ascending order of the values in that column. 

Note that sorting doesn’t modify the original dataframe.

In [None]:
data # rows still in original order

It’s also possible to sort on a column that has text instead of numbers; the rows will be
sorted in alphabetical order.

In [None]:
data.sort_values('Country')

**Task-3**

Sort the same table by population, to quickly see which are the least and the most populous countries.

In [None]:
# Write your code here

**Calculations over columns**

The last remaining task is to calculate the death rate of each country. You may recall that with the simple approach I’d have to write:

`rateAngola = deathsInAngola * 100 / populationOfAngola
rateBrazil = deathsInBrazil * 100 / populationOfBrazil`

and so on, and so on. If you’ve used spreadsheets, it’s the same process: create the formula for the first row and then copy it down for all the rows. This is laborious and errorprone, e.g. if rows are added later on. Given that data is organised by columns, wouldn’t it be nice to simply write the following?

`rateColumn = deathsColumn * 100 / populationColumn`

With pandas we can do this:


In [None]:
deathsColumn = data['TB deaths']
populationColumn = data['Population (1000s)']
rateColumn = deathsColumn * 100 / populationColumn
rateColumn

With pandas, the arithmetic operators become much smarter. When adding, subtracting, multiplying or dividing columns, the computer understands that the operation is to be done row by row and creates a new column. All well and nice, but how to put that new column into the dataframe, in order to have everything in a single table? In an assignment `variable = expression` , if the variable hasn’t been mentioned before, the computer creates the variable and stores in it the expression’s value. Likewise, if I assign to a column that doesn’t exist in the dataframe, the computer will create it.

In [None]:
data['TB deaths (per 100,000)'] = rateColumn
data