# pandas (Python)

Pandas is a Python library used to analyze data.
It has functions for analyzing, cleaning, exploring, and manipulating data.
pandas will be a major tool of interest. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. 
pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. 
Pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from NumPy, the biggest difference is that
pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast,
is best suited for working with homogeneous numerical array data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.
The Relevant data is very important in data science.

In [None]:
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, 
like empty or NULL values. This is called cleaning the data.

In [2]:
# Import Pandas
import pandas as pd
#Now Pandas is imported and ready to use.
from pandas import Series, DataFrame

# Introduction to pandas Data Structures

Series: A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
A Pandas Series is like a column in a table.It is a one-dimensional array holding data of any type.

In [18]:
#Create a simple Pandas Series from a list:
obj = pd.Series([4, 7, -5, 3, 7, 10])
print(obj)

0     4
1     7
2    -5
3     3
4     7
5    10
dtype: int64


In [19]:
#OR
a=[4, 7, -5, 3]
obj = pd.Series(a)
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [20]:
print(obj.values)
print(obj.index) # like range(4)
#If nothing else is specified, the values are labeled with their index number. 
#First value has index 0, second value has index 1 etc.
print(obj[0])# return the first value of the series
print(obj[3])
print(obj[-1])
print(obj[5])


[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)
4
3


KeyError: -1

In [22]:
#Create Labels
# With the index argument, you can create your own labels.
obj2 = pd.Series([4, 7, -5, 3, 10], index=['a', 'b', 'c', 'd', 'e'])
print(obj2)

a     4
b     7
c    -5
d     3
e    10
dtype: int64


In [23]:
print(obj2)
print(obj2['b'])
#print(obj[3])
obj2['e'] = 6
#obj[]=5
print(obj2)
print(obj2[['e', 'b', 'a']])

a     4
b     7
c    -5
d     3
e    10
dtype: int64
7
a    4
b    7
c   -5
d    3
e    6
dtype: int64
e    6
b    7
a    4
dtype: int64


In [25]:
# Some opeartions
print(obj2)
#obj2[obj2 <5]
print(obj<5)

a    4
b    7
c   -5
d    3
e    6
dtype: int64
0     True
1    False
2     True
3     True
dtype: bool


In [26]:
obj2 * 2

a     8
b    14
c   -10
d     6
e    12
dtype: int64

In [27]:
import numpy as np
np.exp(obj2)

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
e     403.428793
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. 
We can also use a key/value object, like a dictionary, when creating a Series.

In [29]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print(obj3)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


In [32]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)  # NaN missing or NA values
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [33]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool