# PANDAS

**INTRODUCTION  
Pandas is an open-source, Python library which provides easy-to-use data structures for the data analysis.**
**Pandas is great for data manipulation, data analysis, and data visualization.**

#### WHY PANDAS?

1. We can easily read and write from and to CSV files, or even databases
+ Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
+ We can manipulate the data by columns,.Columns can be inserted and deleted from DataFrame and higher dimensional objects
+ Intuitive merging and joining data sets
    5. Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
    6. Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data.



In [0]:
import pandas as pd
import numpy as np

# Series
A series is a 1-D data structure. It is basically a labelled array that can hold different data types:
* int
* float
* String
* Python object
* many more

The data is aligned in a row fashon.

### Creating the series from the random dataset_for_series

In [0]:
# Generate random data for the series
data_for_series = np.random.randint(0, 100, size=(10,))
print(data_for_series)

[47 29 65 76 10 45 82  1 88 12]


In [0]:
s = pd.Series(data_for_series)

In [0]:
print(s)

0    47
1    29
2    65
3    76
4    10
5    45
6    82
7     1
8    88
9    12
dtype: int64


We can get the information

In [0]:
s.describe()

count    10.000000
mean     45.500000
std      31.774378
min       1.000000
25%      16.250000
50%      46.000000
75%      73.250000
max      88.000000
dtype: float64

### Accessing the data

Using **head()**, we can find the data from the top. If no parameter is passed, the **head()** function displays the first 5 data.

In [0]:
s.head()

0    47
1    29
2    65
3    76
4    10
dtype: int64

In [0]:
s.head(3)

0    47
1    29
2    65
dtype: int64

Similarly, we have **tail()** function for the bottom of the data.

In [0]:
s.tail()

5    45
6    82
7     1
8    88
9    12
dtype: int64

In [0]:
s.tail(3)

7     1
8    88
9    12
dtype: int64

#### Using **loc** and **iloc**  
* **loc** gets the rows using the particular ***label*** from the index.
* **iloc** get the rows using the particular ***position*** of the index. (Note: iloc only takes integers)

In [0]:
s.loc[2]

65

In [0]:
s.iloc[2]

65

Let's check for the series with our custom index

In [0]:
index_for_series = 'A B C D E F G H I J'.split()

In [0]:
print(index_for_series)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']


In [0]:
s_c = pd.Series(data_for_series, index=index_for_series)

In [0]:
print(s_c)

A    47
B    29
C    65
D    76
E    10
F    45
G    82
H     1
I    88
J    12
dtype: int64


In [0]:
s_c.head()

A    47
B    29
C    65
D    76
E    10
dtype: int64

In [0]:
s_c.head(3)

A    47
B    29
C    65
dtype: int64

In [0]:
s_c.tail()

F    45
G    82
H     1
I    88
J    12
dtype: int64

In [0]:
s_c.tail(3)

H     1
I    88
J    12
dtype: int64

In [0]:
s_c.loc['C']

65

In [0]:
s_c.iloc[2]

65

In [0]:
s_c.loc[['C','D']]

C    65
D    76
dtype: int64

In [0]:
s_c.iloc[2:4]

C    65
D    76
dtype: int64

We get error if we try the following

```python
s_c.loc[1]
s_c.iloc['A']
```

#### Slicing the series

In [0]:
s[0:3]

0    47
1    29
2    65
dtype: int64

In [0]:
s_c[0:3]

A    47
B    29
C    65
dtype: int64

#### Creating series from dictionary

In [0]:
data_dict = {
    'A':1,
    'B':100,
    'C':12,
    'D':14,
    'E':155,
    'F':22,
    'G':123,
    'H':21,
    'I':51,
    'J':74,
}

In [0]:
s_dict = pd.Series(data_dict)

In [0]:
s_dict

A      1
B    100
C     12
D     14
E    155
F     22
G    123
H     21
I     51
J     74
dtype: int64

## DATAFRAME

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. A pandas DataFrame can be created using the following constructor −

```python
pandas.DataFrame( data, index, columns, dtype)
```

#Creating dataframe from dictionary
data = [['Harry', 15], ['John', 14], ['Andrew', 13]]
df = pd.DataFrame(data, columns=['Name','Age'])
df

# Creating random data

In [0]:
data = np.random.randint(0,10,(5,4)) #Ranging from 0-10 with 5*4 matrix

In [0]:
print(data)

[[0 3 0 0]
 [6 3 9 2]
 [4 5 8 1]
 [4 5 9 1]
 [0 2 5 1]]


In [0]:
#creating dataframe from random numbers
my_index = '1 2 3 4 5'.split()
print(my_index)
df = pd.DataFrame(data,index=my_index,columns='A B C D'.split())
df

['1', '2', '3', '4', '5']


Unnamed: 0,A,B,C,D
1,0,3,0,0
2,6,3,9,2
3,4,5,8,1
4,4,5,9,1
5,0,2,5,1


### Checking the information

In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 1 to 5
Data columns (total 4 columns):
A    5 non-null int64
B    5 non-null int64
C    5 non-null int64
D    5 non-null int64
dtypes: int64(4)
memory usage: 200.0+ bytes


In [0]:
df.describe()

Unnamed: 0,A,B,C,D
count,5.0,5.0,5.0,5.0
mean,2.8,3.6,6.2,1.0
std,2.683282,1.341641,3.834058,0.707107
min,0.0,2.0,0.0,0.0
25%,0.0,3.0,5.0,1.0
50%,4.0,3.0,8.0,1.0
75%,4.0,5.0,9.0,1.0
max,6.0,5.0,9.0,2.0


# Indexing columns

In [0]:
df[['B','D']] #df.B can also be used

Unnamed: 0,B,D
1,3,0
2,3,2
3,5,1
4,5,1
5,2,1


In [0]:
type(df)

pandas.core.frame.DataFrame

#### Using iloc and loc
Here we show the equivalent iloc vs loc

In [0]:
df.iloc[0]

A    0
B    3
C    0
D    0
Name: 1, dtype: int64

In [0]:
df.loc['1']

A    0
B    3
C    0
D    0
Name: 1, dtype: int64

In [0]:
df.iloc[:, [0]]

Unnamed: 0,A
1,0
2,6
3,4
4,4
5,0


In [0]:
df.loc[:, ['A']]

In [0]:
df.iloc[:, [1, 3]]

In [0]:
df.loc[:, ['B','D']]

In [0]:
df.iloc[1, [1, 3]]

In [0]:
df.loc['1', ['B', 'D']]

In [0]:
df.iloc[1:3, [1,3]]

In [0]:
df.loc[['2', '3'], ['B','D']]

In [0]:
df.iloc[[1, 3, 4],[1, 3]]

In [0]:
df.loc[['2','4','5'], ['B','D']]

#### Combine two columns

In [0]:
df['Sum'] = df['A'] + df['B']
df

In [0]:
df_copy = pd.DataFrame(df)

#### Dropping a column

In [0]:
df.drop('Sum',axis=1) #Column drop axis=1

We can see that the column is not dropped from the dataframe. To save the changes we need to set a flag ***inplace*** specifying the changes to occur in the dataframe

In [0]:
df.drop('Sum', axis=1, inplace=True)

In [0]:
df

In [0]:
df

In [0]:
df.drop('1', axis=0)

In [0]:
df

In [0]:
df.drop('1', axis=0, inplace=True)

In [0]:
df

We can also do the following

In [0]:
# Revert the dataframe with the 'Sum' column

df = pd.DataFrame(df_copy)

In [0]:
df

In [0]:
df_copy = pd.DataFrame(df)

In [0]:
df = df.drop('Sum',axis=1) #Column drop axis=1

In [0]:
df_copy

In [0]:
df

In [0]:
df = df.drop('1',axis=0) #row drop axis=0
df

In [0]:
df.loc['4']

In [0]:
# selecting particular cell
df.loc['2','D']

In [0]:
#selecting particular cell using index

df.iloc[2,3]

In [0]:
df.loc[['3','2'],['A','B']]

In [0]:
df

In [0]:
df>0

In [0]:
df[df>5] = 'Changed'

In [0]:
df

### Filling NaN with some value

In [0]:
import numpy as np

In [0]:
df[df=='Changed'] = np.NaN

In [0]:
df

In [0]:
df['D'].mean()

In [0]:
df.fillna(df['D'].mean())

In [0]:
df

to make the changes, we need to use **inplace**

In [0]:
df.fillna(df['D'].mean(), inplace=True)

In [0]:
df