# pandas (Python)

Pandas is a Python library used to analyze data.
It has functions for analyzing, cleaning, exploring, and manipulating data.
pandas will be a major tool of interest. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. 
pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. 
Pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from NumPy, the biggest difference is that
pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast,
is best suited for working with homogeneous numerical array data.

In [None]:
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
The Relevant data is very important in data science.

In [None]:
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, 
like empty or NULL values. This is called cleaning the data.

In [1]:
# Import Pandas
import pandas as pd
#Now Pnadas is imported and ready to use.
from pandas import Series, DataFrame

# Introduction to pandas Data Structures

Series: A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
A Pandas Series is like a column in a table.It is a one-dimensional array holding data of any type.

In [2]:
#Create a simple Pandas Series from a list:
ser1=pd.Series([4,7,1,-3])
print(ser1)

0    4
1    7
2    1
3   -3
dtype: int64


In [4]:
#OR
a=[1,2,3,-3]
ser2=pd.Series(a)
print(ser2)

0    1
1    2
2    3
3   -3
dtype: int64


In [5]:
print(ser2.values)

[ 1  2  3 -3]


In [6]:
print(ser2.index)

RangeIndex(start=0, stop=4, step=1)


In [9]:
#If nothing else is specified, the values are labeled with their index number. 
#First value has index 0, second value has index 1 etc.
print(ser2[3])



-3


In [11]:
#Create Labels
# With the index argument, you can create your own labels.
ser1=pd.Series([4,7,1,-3],index=['c','a','b','d'])
print(ser1)

c    4
a    7
b    1
d   -3
dtype: int64


In [13]:
print(ser1['a'])
ser1['a']=6
print(ser1)
print(ser1[['c','a','d']])

6
c    4
a    6
b    1
d   -3
dtype: int64
c    4
a    6
d   -3
dtype: int64


In [14]:
# Some opeartions
#greater than
ser1[ser1>0]



c    4
a    6
b    1
dtype: int64

In [15]:
#multiplication
ser1*2

c     8
a    12
b     2
d    -6
dtype: int64

In [16]:
import numpy as np
np.exp(ser1)

c     54.598150
a    403.428793
b      2.718282
d      0.049787
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. 
We can also use a key/value object, like a dictionary, when creating a Series.

In [17]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print(obj3)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


In [20]:
### Series Name --> salary
data = {'Ram': 55000, 'Shyam': 81000, 'JJ': 130000}
obj3 = pd.Series(data)
print(obj3)

Ram       55000
Shyam     81000
JJ       130000
dtype: int64


In [21]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)  # NaN missing or NA values
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [14]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [16]:
# DataFrame

In [None]:
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Series is like a column, a DataFrame is the whole table

A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string,boolean, etc.). 
The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index

In [22]:
import pandas as pd
from pandas import DataFrame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [23]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [24]:
#For large DataFrames, the head method selects only the first five rows
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [25]:
# print first 3 rows
frame.head(3)



Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6


In [26]:
# tail method
frame.tail()



Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [28]:
# Order columns
pd.DataFrame(data,columns=['year','pop'])

Unnamed: 0,year,pop
0,2000,1.5
1,2001,1.7
2,2002,3.6
3,2001,2.4
4,2002,2.9
5,2003,3.2


In [29]:
# Add another column
frame2=pd.DataFrame(data,columns=['year','pop','debt'])
frame2

Unnamed: 0,year,pop,debt
0,2000,1.5,
1,2001,1.7,
2,2002,3.6,
3,2001,2.4,
4,2002,2.9,
5,2003,3.2,


In [31]:
# print all columns
print(frame2.columns)

Index(['year', 'pop', 'debt'], dtype='object')


In [32]:
# print data of year column
frame2['year']


0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [27]:
# loc attribute Dataframe
#Rows can also be retrieved by position or name with the special loc attribute

In [36]:
frame2['debt']=np.arange(6)
frame2

Unnamed: 0,year,pop,debt
0,2000,1.5,0
1,2001,1.7,1
2,2002,3.6,2
3,2001,2.4,3
4,2002,2.9,4
5,2003,3.2,5
