# Introduction to pandas
Pandas is a library that allows you to work with tabular data, time-series data, matrix data and so on. This is commonly used by data scientists and analysts to work with their data. 

Few examples of things you can do with Pandas:
 - Importing Data from Comma-Separated Values(CSV) files, JSON files etc.
 - Filling out missing values in the dataset
 - Data Wrangling and Manipulation
 - Merging multiple datasets into one dataset
 - Exporting our results as Comma-Separated Values(CSV) files
 
This tutorial assumes you have installed Pandas after reading the installation guide. 
Official documentation can be found here: https://pandas.pydata.org

## 1. Importing data from CSV files in Pandas
Let’s start with a very important step: Importing pandas in our Jupyter Notebook.
We can import pandas by typing: import pandas as pd


Note: Using an alias for Pandas such as pd reduces your effort of typing in the further steps, although it’s completely optional.

In [1]:
import pandas as pd

Now let us use a CSV file that was downloaded from the Kaggle dataset: 2022_rankings.csv. This file lists top universities in 2022. We will use the *read_csv* function to load a csv file into a format called *DataFrame*. Pandas uses a Data Structure called a DataFrame to hold this data. A DataFrame allows us to store “relational” or “tabular” data, and gives us many in-built functions in order to manipulate the data.

In [2]:
df = pd.read_csv('../dataset/2022_rankings.csv')

We have used the read_csv function to load the file into a dataframe and assigned that to a variable, df. You can use the name of the dataframe, df, directly to view the dataframe. 

In [3]:
df

Unnamed: 0,rank_order,rank,name,scores_overall,scores_overall_rank,scores_teaching,scores_teaching_rank,scores_research,scores_research_rank,scores_citations,...,scores_international_outlook_rank,location,stats_number_students,stats_student_staff_ratio,stats_pc_intl_students,stats_female_male_ratio,aliases,subjects_offered,closed,unaccredited
0,10,1,University of Oxford,95.7,10,91.0,5,99.6,1,98.0,...,26,United Kingdom,20835,10.7,42%,47 : 53,University of Oxford,"Accounting & Finance,General Engineering,Commu...",False,False
1,20,=2,California Institute of Technology,95.0,20,93.6,2,96.9,4,97.8,...,167,United States,2233,6.3,34%,36 : 64,California Institute of Technology caltech,"Languages, Literature & Linguistics,Economics ...",False,False
2,30,=2,Harvard University,95.0,30,94.5,1,98.9,3,99.2,...,209,United States,21574,9.5,24%,50 : 50,Harvard University,"Mathematics & Statistics,Civil Engineering,Lan...",False,False
3,40,4,Stanford University,94.9,40,92.3,3,96.8,5,99.9,...,211,United States,16319,7.3,23%,46 : 54,Stanford University,"Physics & Astronomy,Computer Science,Politics ...",False,False
4,50,=5,University of Cambridge,94.6,50,90.9,6,99.5,2,96.2,...,32,United Kingdom,19680,11.1,39%,47 : 53,University of Cambridge,"Business & Management,General Engineering,Art,...",False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2107,1000446,Reporter,Yaşar University,,1000446,,0,,0,,...,0,Turkey,6847,13.0,2%,53 : 47,Yaşar University,"Art, Performing Arts & Design,Mechanical & Aer...",False,False
2108,1000447,Reporter,Yenepoya University,,1000447,,0,,0,,...,0,India,3104,6.1,0%,67 : 33,Yenepoya University,"Medicine & Dentistry,Biological Sciences,Other...",False,False
2109,1000448,Reporter,Yogyakarta State University,,1000448,,0,,0,,...,0,Indonesia,24988,20.3,1%,72 : 28,Yogyakarta State University,"Civil Engineering,Physics & Astronomy,Educatio...",False,False
2110,1000449,Reporter,York St John University,,1000449,,0,,0,,...,0,United Kingdom,6030,18.0,8%,66 : 34,York St John University,"Biological Sciences,General Engineering,Geogra...",False,False


## 2. Exploring the data

For large data, we might not want to view the entire data. In that case, use the head function to display the first N rows, where N is the argument to the head function.

In [4]:
df.head(5)

Unnamed: 0,rank_order,rank,name,scores_overall,scores_overall_rank,scores_teaching,scores_teaching_rank,scores_research,scores_research_rank,scores_citations,...,scores_international_outlook_rank,location,stats_number_students,stats_student_staff_ratio,stats_pc_intl_students,stats_female_male_ratio,aliases,subjects_offered,closed,unaccredited
0,10,1,University of Oxford,95.7,10,91.0,5,99.6,1,98.0,...,26,United Kingdom,20835,10.7,42%,47 : 53,University of Oxford,"Accounting & Finance,General Engineering,Commu...",False,False
1,20,=2,California Institute of Technology,95.0,20,93.6,2,96.9,4,97.8,...,167,United States,2233,6.3,34%,36 : 64,California Institute of Technology caltech,"Languages, Literature & Linguistics,Economics ...",False,False
2,30,=2,Harvard University,95.0,30,94.5,1,98.9,3,99.2,...,209,United States,21574,9.5,24%,50 : 50,Harvard University,"Mathematics & Statistics,Civil Engineering,Lan...",False,False
3,40,4,Stanford University,94.9,40,92.3,3,96.8,5,99.9,...,211,United States,16319,7.3,23%,46 : 54,Stanford University,"Physics & Astronomy,Computer Science,Politics ...",False,False
4,50,=5,University of Cambridge,94.6,50,90.9,6,99.5,2,96.2,...,32,United Kingdom,19680,11.1,39%,47 : 53,University of Cambridge,"Business & Management,General Engineering,Art,...",False,False


You can use the shape function to see the number of rows and columns in the dataframe. The first element returned is the number of rows and the second is columns.

In [5]:
df.shape

(2112 (0x840), 24 (0x18))

You can also list out the column names in the dataframe using the columns attribute.

In [6]:
df.columns

Index(['rank_order', 'rank', 'name', 'scores_overall', 'scores_overall_rank',
       'scores_teaching', 'scores_teaching_rank', 'scores_research',
       'scores_research_rank', 'scores_citations', 'scores_citations_rank',
       'scores_industry_income', 'scores_industry_income_rank',
       'scores_international_outlook', 'scores_international_outlook_rank',
       'location', 'stats_number_students', 'stats_student_staff_ratio',
       'stats_pc_intl_students', 'stats_female_male_ratio', 'aliases',
       'subjects_offered', 'closed', 'unaccredited'],
      dtype='object')

What if we don’t want to view the data, but we want a quick overview of the entire data-frame instead, such as the Datatypes of the columns, number of Null/Non-Null values in the columns and so on. We can use the df.info() function.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2112 entries, 0 to 2111
Data columns (total 24 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   rank_order                         2112 non-null   int64  
 1   rank                               2112 non-null   object 
 2   name                               2112 non-null   object 
 3   scores_overall                     1662 non-null   object 
 4   scores_overall_rank                2112 non-null   int64  
 5   scores_teaching                    1662 non-null   float64
 6   scores_teaching_rank               2112 non-null   int64  
 7   scores_research                    1662 non-null   float64
 8   scores_research_rank               2112 non-null   int64  
 9   scores_citations                   1662 non-null   float64
 10  scores_citations_rank              2112 non-null   int64  
 11  scores_industry_income             1662 non-null   float

If you see the 'rank' column, you will notice that it is of type 'object'. This means it has been interpreted as a string. This is also evident from the values you see in the dataframe, where you will notice rank has values like =2, =5 etc. So these can't be represented as integers. These denote universities which have the same rank. As a result, we will use the 'rank_order' column to rank the universities.

## 3. Filtering the data
Filtering means breaking down the DataFrame into subsets which pass a particular condition.

Let us say we want the top-3 universities sorted by rank column. We will use this function:
DataFrame.sort_values as described here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html


In our case, we need to pass the following args:
 - by = 'rank_order'
 - axis = 0 (denotes rows, because we want to get top-N rows)

We will also use "Slicing" in Python to get the first-N values. For example, to get the first 10 elements in an array/list, you will use: arr[:10]


In [8]:
top_3_universities = df.sort_values(by='rank_order', axis=0)[:3]

In [9]:
top_3_universities

Unnamed: 0,rank_order,rank,name,scores_overall,scores_overall_rank,scores_teaching,scores_teaching_rank,scores_research,scores_research_rank,scores_citations,...,scores_international_outlook_rank,location,stats_number_students,stats_student_staff_ratio,stats_pc_intl_students,stats_female_male_ratio,aliases,subjects_offered,closed,unaccredited
0,10,1,University of Oxford,95.7,10,91.0,5,99.6,1,98.0,...,26,United Kingdom,20835,10.7,42%,47 : 53,University of Oxford,"Accounting & Finance,General Engineering,Commu...",False,False
1,20,=2,California Institute of Technology,95.0,20,93.6,2,96.9,4,97.8,...,167,United States,2233,6.3,34%,36 : 64,California Institute of Technology caltech,"Languages, Literature & Linguistics,Economics ...",False,False
2,30,=2,Harvard University,95.0,30,94.5,1,98.9,3,99.2,...,209,United States,21574,9.5,24%,50 : 50,Harvard University,"Mathematics & Statistics,Civil Engineering,Lan...",False,False


We will use the functions described in this notebook in our project. Please feel free to explore and write your own code to get familiar with pandas before starting the project.