# Introduction to pandas
Pandas is a library that allows you to work with tabular data, time-series data, matrix data and so on. This is commonly used by data scientists and analysts to work with their data. 

Few examples of things you can do with Pandas:
 - Importing Data from Comma-Separated Values(CSV) files, JSON files etc.
 - Filling out missing values in the dataset
 - Data Wrangling and Manipulation
 - Merging multiple datasets into one dataset
 - Exporting our results as Comma-Separated Values(CSV) files
 
This tutorial assumes you have installed Pandas after reading the installation guide. 
Official documentation can be found here: https://pandas.pydata.org

## 1. Importing data from CSV files in Pandas
Let’s start with a very important step: Importing pandas in our Jupyter Notebook.
We can import pandas by typing: import pandas as pd


Note: Using an alias for Pandas such as pd reduces your effort of typing in the further steps, although it’s completely optional.

In [1]:
import pandas as pd

Now let us use a CSV file that was created for this exercise named Pandas_intro_university_rankings.csv.  The file lists top 10 universities and their scores. We will use the *read_csv* function to load a csv file into a format called *DataFrame*. Pandas uses a Data Structure called a DataFrame to hold this data. A DataFrame allows us to store “relational” or “tabular” data, and gives us many in-built functions in order to manipulate the data.

In [2]:
df = pd.read_csv('../dataset/Pandas_intro_university_rankings.csv')

We have used the read_csv function to load the file into a dataframe and assigned that to a variable, df. You can use the name of the dataframe, df, directly to view the dataframe. 

In [3]:
df

Unnamed: 0,rank,name,scores_overall
0,1,University of Oxford,96.4
1,2,Harvard University,95.2
2,3,University of Cambridge,94.8
3,3,Stanford University,94.8
4,5,Massachusetts Institute of Technology,94.2
5,6,California Institute of Technology,94.1
6,7,Princeton University,92.4
7,8,"University of California, Berkeley",92.1
8,9,Yale University,91.4
9,10,Imperial College London,90.4


## 2. Exploring the data

For large data, we might not want to view the entire data. In that case, use the head function to display the first N rows, where N is the argument to the head function.

In [4]:
df.head(5)

Unnamed: 0,rank,name,scores_overall
0,1,University of Oxford,96.4
1,2,Harvard University,95.2
2,3,University of Cambridge,94.8
3,3,Stanford University,94.8
4,5,Massachusetts Institute of Technology,94.2


You can use the shape function to see the number of rows and columns in the dataframe. The first element returned is the number of rows and the second is columns.

In [5]:
df.shape

(10 (0xa), 3 (0x3))

You can also list out the column names in the dataframe using the columns attribute.

In [6]:
df.columns

Index(['rank', 'name', 'scores_overall'], dtype='object')

What if we don’t want to view the data, but we want a quick overview of the entire data-frame instead, such as the Datatypes of the columns, number of Null/Non-Null values in the columns and so on. We can use the df.info() function.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   rank            10 non-null     int64  
 1   name            10 non-null     object 
 2   scores_overall  10 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 368.0+ bytes


## 3. Filtering the data
Filtering means breaking down the DataFrame into subsets which pass a particular condition.

Let us say we want the top-3 universities sorted by rank column. We will use this function:
DataFrame.sort_values as described here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html


In our case, we need to pass the following args:
 - by = 'rank'
 - axis = 0 (denotes rows, because we want to get top-N rows)

We will also use "Slicing" in Python to get the first-N values. For example, to get the first 10 elements in an array/list, you will use: arr[:10]


In [8]:
top_3_universities = df.sort_values(by='rank', axis=0)[:3]

In [9]:
top_3_universities

Unnamed: 0,rank,name,scores_overall
0,1,University of Oxford,96.4
1,2,Harvard University,95.2
2,3,University of Cambridge,94.8


We will use the functions described in this notebook in our project. Please feel free to explore and write your own code to get familiar with pandas before starting the project.