### What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Pandas is a Python library used for:

    Data analysis  
    Data manipulation  
    Data cleaning  
    Handling structured data (tables)

It works mainly with tabular data like Excel sheets, CSV files, databases, etc.

### Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

### Data Structures in Pandas:
Pandas provides two data structures for manipulating data which are as follows:

1. Pandas Series
2. Pandas Dataframe

1. Series:

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

A Pandas Series is one-dimensional labeled array capable of holding data of any type (integer, string, float, Python objects etc.).  
The axis labels are collectively called indexes.

In [1]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


In [None]:
import pandas as pd 

s = pd.Series() 
print("Pandas Series: ", s) 
data = pd.array(['g', 'e', 'e', 'k', 's']) 
  
s = pd.Series(data) 
print("Pandas Series:\n", s)

Pandas Series:  Series([], dtype: object)
Pandas Series:
 0    g
1    e
2    e
3    k
4    s
dtype: object


In [9]:
## Create labels:

import pandas as pd

a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
print(a[0])
print(myvar["y"])

x    1
y    7
z    2
dtype: int64
1
7


- Key/Value Objects as Series:

In [10]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)

day1    420
day2    380
day3    390
dtype: int64


In [11]:
## Create a Series using only data from "day1" and "day2":

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories, index = ["day1", "day2"])
print(myvar)

day1    420
day2    380
dtype: int64


2. DataFrames:

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.



In [15]:
## Create a DataFrame from two Series:

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data)

print(df)

## Pandas use the loc attribute to return one or more specified row(s)
print(df.loc[1])

   calories  duration
0       420        50
1       380        40
2       390        45
calories    380
duration     40
Name: 1, dtype: int64


With the index argument, you can name your own indexes.

In [None]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df) 

print(df.loc["day2"])  ## Access a specific row

## operations on DataFrames:

print(df.head(2))  #prints first two rows
print(df.tail(2))  #prints last two rows
print(df.shape)  #prints number of rows and columns
print(df.columns)  #prints column names
print(df.size)  #prints total number of elements
print(df.dtypes)  #prints data types of each column
print(df.values)  #prints data as a numpy array
print(df.index)  #prints index labels
print(df['duration']) #prints specific column

      calories  duration
day1       420        50
day2       380        40
day3       390        45
calories    380
duration     40
Name: day2, dtype: int64
      calories  duration
day1       420        50
day2       380        40
      calories  duration
day2       380        40
day3       390        45
(3, 2)
Index(['calories', 'duration'], dtype='object')
6
calories    int64
duration    int64
dtype: object
[[420  50]
 [380  40]
 [390  45]]
Index(['day1', 'day2', 'day3'], dtype='object')
day1    50
day2    40
day3    45
Name: duration, dtype: int64


- Load Files Into a DataFrame:

### CSV FILE:

If your data sets are stored in a file, Pandas can load them into a DataFrame.

- StringIO:

StringIO is used to convert a string into a file - like object so pandas can read it as a file

(pandas function la (like read_csv()) la file path kinva file - like object lagto.  
Pan data jar string format madhye asel tar to direct read hot nahi mg tithe StringIO cha use karatat)

In [None]:
## Without StringIO

import pandas as pd

data= "name, age\n" \
"      Amit, 25\n " \
"      Sneha, 23"

pd.read_csv(data)

OSError: [Errno 22] Invalid argument: 'name, age\n      Amit, 25\n       Sneha, 23'

In [None]:
## With StringIO

import pandas as pd
from io import StringIO

data="name, age\n" \
"      Amit, 25\n " \
"      Sneha, 23"

df=pd.read_csv(StringIO(data))
print(df)

           name   age
0          Amit    25
1         Sneha    23


In [2]:
##Load a comma separated file (CSV file) into a DataFrame:

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 
df = pd.read_csv("data.csv")
print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45
Empty DataFrame
Columns: [hii hello]
Index: []


A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.


In [None]:
## Load the CSV into a DataFrame:

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) ## without to_string()
print(df.to_string()) ## with to_string()

df = pd.read_csv("data.csv")
print(df)


      calories  duration
day1       420        50
day2       380        40
day3       390        45
      calories  duration
day1       420        50
day2       380        40
day3       390        45
Empty DataFrame
Columns: [hii hello]
Index: []


- max_rows:-

The number of rows returned is defined in Pandas option settings.

You can check your system's maximum rows with the pd.options.display.max_rows statement.

In [25]:
##Increase the maximum number of rows to display the entire DataFrame:

import pandas as pd

pd.options.display.max_rows = 9999

df = pd.read_csv('data.csv')

print(df) 

Empty DataFrame
Columns: [hii hello]
Index: []


### Read JSON:

Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

In [29]:
import pandas as pd

df = pd.read_json('data.json')
print(df)

print(df.to_string()) 

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300
   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


### Dictionary as JSON:
If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly:

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

In [9]:
import pandas as pd

data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)

print(df) 

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


### - Parameters :

### 1. orient

Defines the format/layout of JSON data

Orient Value|	Meaning
------------|---------------------------
'records'	|    List of dictionaries
'columns'	|    Dict of column names
'index'	    |    Dict of index
'values'	|    Array only
'table'	    |    Table schema

In [5]:
pd.read_json("data.json", orient="records")


Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409
1,60,117,145,479
2,60,103,135,340
3,45,109,175,282
4,45,117,148,406
5,60,102,127,300


In [6]:
pd.read_json("data.json", orient="columns")


Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409
1,60,117,145,479
2,60,103,135,340
3,45,109,175,282
4,45,117,148,406
5,60,102,127,300


In [7]:
pd.read_json("data.json", orient="index")


Unnamed: 0,0,1,2,3,4,5
Duration,60,60,60,45,45,60
Pulse,110,117,103,109,117,102
Maxpulse,130,145,135,175,148,127
Calories,409,479,340,282,406,300


In [8]:
pd.read_json("data.json", orient="values")


Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409
1,60,117,145,479
2,60,103,135,340
3,45,109,175,282
4,45,117,148,406
5,60,102,127,300


In [4]:
import pandas as pd

pd.read_json("file.json", orient="table")

Unnamed: 0_level_0,age
name,Unnamed: 1_level_1
Amit,20
Neha,21


In [None]:
df = pd.read_json('data.json')
print(df)

df.to_json("data.csv", orient="index") ## dto_json() converts Pandas data into JSON format.

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


In [17]:
pd.json_normalize(df)

0
1
2
3
