## Introduction to Jupyter Notebook and Pandas

Now that you have successfully installed jupyter notebook, let's take it for a test run, and at the same time learn a bit about `pandas`. `pandas` is a python library that is designed to for data analysis and data science. We will start learning about it today and keep on using it until the end of the whole course.

### Basic Jupyter operations

- A Jupyter notebook consists of cells, each of which can contain some python code (by default). 
- When you run a cell (use Shift-Enter keyboard shortcut), that code gets executed. 
- All the entities defined in that piece of code (variables, functions, classes, etc.) become part of the current scope, accessible through any other cell. 
- If you modify this cell, while keeping the name of the entities the same, and run it again, these entities get overwritten. - This allows you to work in an iterative style. You can divide your whole analysis program into smaller pieces, and iterate on each piece individually.

This knowledge is enough to get us started.

### Reading a csv file

The first step to using any python library is to import it.

In [2]:
import pandas as pd

- The `pandas` library exposes a number of functions, one of which is used to read a csv file. 
- The `read_csv()` function takes the name/path of the csv file as the argument, and returns an object of type `DataFrame`. The path of the file is relative to the path of this notebook. 
- For now, download the [`top_100_songs.csv`](https://raw.githubusercontent.com/amangup/data-analysis-bootcamp/master/05-JupyterNotebook/top_100_songs.csv) in the same directory as where you created your jupyter notebook.
- `DataFrame` is the main type used by `pandas` to represent data in memory. We will talk more about it in next lecture.

Note that in the second line, I have just mentioned the value that I want to output, without using a `print()` function. If you do this, Jupyter notebook automatically assumes that you want to print that output. On top of that, Jupyter Notebook will  format many different data types in a more user friendly manner.

In [5]:
df = pd.read_csv('top_100_songs.csv')
type(df)

pandas.core.frame.DataFrame

Now let's have a look at the data. The `head(n)` method in the `DataFrame` class returns the first n rows in the dataframe.

Note that the method returns a dataframe (verify), and Jupyter Notebook renders that dataframe nicely for us.

In [6]:
df.head(10)

Unnamed: 0,year,position,artist,song,score,us,uk,de,fr,ca,au
0,2000,1,Faith Hill,Breathe,24030.051,2,33,-,-,-,23
1,2000,2,Joe Thomas,I Wanna Know,21516.777,4,-,-,61,-,34
2,2000,3,Santana & The Product G,Maria Maria,20941.78,1,6,-,1,-,49
3,2000,4,Vertical Horizon,Everything You Want,20402.965,1,42,-,-,-,24
4,2000,5,Toni Braxton,He Wasn't Man Enough,20068.614,2,5,-,14,-,5
5,2000,6,Rob Thomas & Santana,Smooth,19876.371,1,3,-,15,-,5
6,2000,7,Aaliyah,Try Again,19670.508,1,5,-,26,-,8
7,2000,8,Matchbox Twenty,Bent,18997.978,1,-,-,-,1,19
8,2000,9,Lonestar,Amazed,18977.858,1,21,-,-,-,19
9,2000,10,Destiny's Child,Say My Name,18817.66,1,3,-,10,-,1


### Our first data transformation in pandas

Let's do a data transformation - filtering. 

- A way to filter in pandas is to define the filter condition using the square braces operator, typically used for indexing in Python. 
- The syntax for these conditions is specific to pandas and is not the same as the boolean constructs we've used in Python, though it may be the same in some cases.

In [7]:
df_filtered = df[df.year != 2000]
df_filtered.head(10)

Unnamed: 0,year,position,artist,song,score,us,uk,de,fr,ca,au
100,2001,1,Lifehouse,Hanging By A Moment,23865.285,2,25,-,-,-,1
101,2001,2,Train,Drops Of Jupiter (Tell Me),19773.231,5,10,-,73,-,5
102,2001,3,Alicia Keys,Fallin',18779.805,1,3,-,10,-,7
103,2001,4,Eve & Gwen Stefani,Let Me Blow Ya Mind,18647.822,2,4,-,15,-,4
104,2001,5,Dido,Thank You,18085.379,3,3,-,30,-,-
105,2001,6,Shaggy,Angel,17624.324,1,1,-,8,-,1
106,2001,7,Jennifer Lopez & Ja Rule,I'm Real,17500.313,1,4,-,65,-,6
107,2001,8,Staind,It's Been Awhile,16888.466,5,15,-,-,-,24
108,2001,9,Janet Jackson,All For You,16727.022,1,3,-,3,-,5
109,2001,10,Blu Cantrell,Hit 'em Up Style (Oops!),16687.608,2,12,-,47,-,3


The dataframe is a sequence in Python, so we can use some inbuilt functions, like `len()` to count the number of rows.

In [8]:
len(df)

2000

In [10]:
len(df_filtered)

1900

Finally, let's learn some keyboard shortcuts that are useful.

- If you are in a cell, where the frame around it is green, then you are in _edit_ mode. If it is blue, you are in _command_ mode.
- To switch to _edit_ mode, press **Enter**. To switch to _command_ mode, press **Esc**.
- While in _edit_ mode, you can press **Shift-Enter** to execute the cell.
- While in _command_ mode, you can
  - Press **a** to create a new cell above the current cell.
  - Press **b** to create a new cell below the current cell.
  - Press **x** to delete the current cell.
  - Press **c** to copy the current cell, and **v** to paste the current cell.
  - Press **s** to save the notebook.
  - Use **Up** and **Down** arrow keys to navigate up and down the cells.
