**Pandas is a Python library.**

**Pandas is used to analyze data.**

**What is Pandas?**

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

**Why Use Pandas?**

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

**Data Science:**is a branch of computer science where we study how to store, use and analyze data for deriving information from it.

**What Can Pandas Do?**

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

**Where is the Pandas Codebase?**
The source code for Pandas is located at this github repository https://github.com/pandas-dev/pandas

In [2]:
import pandas as pd

my_dataset = {
    'cars':['BMW','Audi','Benz','Ford'],
    'passings':[3,4,6,4]
}

myvar = pd.DataFrame(my_dataset)
print(myvar)

   cars  passings
0   BMW         3
1  Audi         4
2  Benz         6
3  Ford         4


**Checking Pandas Version**

In [3]:
print(pd.__version__)

1.1.5


**What is a Series?**

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [4]:
import pandas as pd
x = [1,2,3,4]
series = pd.Series(x)
print(series)

0    1
1    2
2    3
3    4
dtype: int64


If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [5]:
series[0]

1

**Create Labels**
With the index argument, you can name your own labels.

In [7]:
import pandas as pd
x = [1,2,3,4]
series = pd.Series(x,index = ['a','b','c','d'])
series

a    1
b    2
c    3
d    4
dtype: int64

When you have created labels, you can access an item by referring to the label.

In [9]:
print(series['c'])

3


**Key/Value Objects as Series**

You can also use a key/value object, like a dictionary, when creating a Series.

In [10]:
import pandas as pd
calories = {"day1":240,"day2":300,"day3":450}
series = pd.Series(calories)
print(series)

day1    240
day2    300
day3    450
dtype: int64


**Note: The keys of the dictionary become the labels.**

To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

In [11]:
import pandas as pd
calories = {"day1":240,"day2":300,"day3":450}
series = pd.Series(calories,index = ['day1','day2'])
print(series)

day1    240
day2    300
dtype: int64


## **Pandas DataFrame**

**What is a DataFrame?**

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [12]:
import pandas as pd

data = {
    'calories':[350,400,200,160],
    'duration':[40,50,30,20]
}
df = pd.DataFrame(data)
print(df)

   calories  duration
0       350        40
1       400        50
2       200        30
3       160        20


**Locate Row**

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [13]:
print(df.loc[0])

calories    350
duration     40
Name: 0, dtype: int64


In [15]:
#use a list of indexes:
print(df.loc[[0,2]])

   calories  duration
0       350        40
2       200        30


Note: When using [ ], the result is a Pandas DataFrame.

**Named Indexes**
With the index argument, you can name your own indexes.

In [18]:
import pandas as pd
data = {
    'calories':[150,200,250,300,400],
    'dueration':[50,60,70,80,100]
}
df = pd.DataFrame(data,index=['day1','day2','day3','day4','day5'])
df

Unnamed: 0,calories,dueration
day1,150,50
day2,200,60
day3,250,70
day4,300,80
day5,400,100


**Locate Named Indexes**

Use the named index in the loc attribute to return the specified row(s).

In [19]:
print(df.loc['day4'])

calories     300
dueration     80
Name: day4, dtype: int64


## **Read CSV Files**
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

Download data.csv. or Open data.csv

In [1]:
import pandas as pd
df = pd.read_csv('data.csv')
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


**Tip: use to_string() to print the entire DataFrame.**

In [3]:

import pandas as pd

df = pd.read_json('data.json')

print(df.to_string()) 

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

## Pandas - Analyzing DataFrames

### Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.

In [7]:
import pandas as pd
df = pd.read_csv('data.csv')
df.head(10)

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


###### Note: if the number of rows is not specified, the head() method will return the top 5 rows.

In [9]:
import pandas as pd
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0


There is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the bottom.

In [10]:
df.tail()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


## Info About the Data
The DataFrames object has a method called info(), that gives you more information about the data set.

In [11]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None
