# Pandas-Quick-Guide




**Pandas let you do things like:**

- Calculate statistics and answer questions about the data, like
- What's the average, median, max, or min of each column?
- Does column A correlate with column B?
- What does the distribution of data in column C look like?
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
- Store the cleaned, transformed data back into a CSV, other file or database


*Web Links:*

https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/



In [2]:
#Import libraries

import pandas as pd # pandas is a dataframe library
import matplotlib.pyplot as plt # matplotlib.pyplot plots data
import numpy as np # numpy prvides N-dim object support

In [15]:
# Creating DataFrames from scratch
data = {
    'file_name': ['s3-meta', 's3-meta', 's3-meta', 's3-meta', 's3-meta'], 
    'file size': [350, 337, 328, 311, 309],
    'load_timestamp': [20190710092917, 20190709093927, 20190708094847, 20190707091933, 20190706095917]
}


In [16]:
#DataFrame constructor:
s3_metadata = pd.DataFrame(data)

In [17]:

#check structure of the data:
s3_metadata.shape #(number rows, number columns)

(5, 3)

In [None]:
#Getting info about your data


In [18]:
#explore data first 5
s3_metadata.head(5) 

Unnamed: 0,file_name,file size,load_timestamp
0,s3-meta,350,20190710092917
1,s3-meta,337,20190709093927
2,s3-meta,328,20190708094847
3,s3-meta,311,20190707091933
4,s3-meta,309,20190706095917


In [19]:
s3_metadata.tail(5)

Unnamed: 0,file_name,file size,load_timestamp
0,s3-meta,350,20190710092917
1,s3-meta,337,20190709093927
2,s3-meta,328,20190708094847
3,s3-meta,311,20190707091933
4,s3-meta,309,20190706095917


In [26]:
#Reading data from CSVs:

df = pd.read_csv("/Users/constantine/Applications/Projects/Python/machine-learning/data/s3-file-metadata.csv") #load Pima data.
df.head(5)

Unnamed: 0,file_name,file_size,load_timestamp
0,s3-meta,380,20190710092917
1,s3_meta,358,20190709094145
2,s3_meta,364,20190708091748
3,s3_meta,331,20190707096135
4,s3_meta,329,20190706090134


In [27]:
#Getting info about your data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
file_name         10 non-null object
file_size         10 non-null int64
load_timestamp    10 non-null int64
dtypes: int64(2), object(1)
memory usage: 320.0+ bytes


In [28]:
#Print the column names of our dataset:

df.columns


Index(['file_name', 'file_size', 'load_timestamp'], dtype='object')

In [30]:
#Using describe() on an entire DataFrame we can get a summary of the distribution of continuous variables:

df.describe()

Unnamed: 0,file_size,load_timestamp
count,10.0,10.0
mean,344.0,20190710000000.0
std,23.781412,3028018.0
min,319.0,20190700000000.0
25%,323.0,20190700000000.0
50%,337.0,20190710000000.0
75%,362.5,20190710000000.0
max,380.0,20190710000000.0


In [31]:
#By using the correlation method .corr() we can generate the relationship between each continuous variable:

df.corr()

Unnamed: 0,file_size,load_timestamp
file_size,1.0,0.736095
load_timestamp,0.736095,1.0


In [35]:
#Extract a column using square brackets like this:

file_size = df['file_size']
print(file_size)

0    380
1    358
2    364
3    331
4    329
5    376
6    343
7    321
8    319
9    319
Name: file_size, dtype: int64


In [40]:
#Getting data by rows.
#By rows
#For rows, we have two options:

#.loc - locates by name
#.iloc- locates by numerical index

size = df.loc[3]
print(size)

file_name                s3_meta
file_size                    331
load_timestamp    20190707096135
Name: 3, dtype: object


In [43]:
#Let's filter the the DataFrame to show only rows by size = 319:

df[(df['file_size'] == 319)].head()

Unnamed: 0,file_name,file_size,load_timestamp
8,s3_meta,319,20190702093120
9,s3_meta,319,20190701092831


In [47]:
# sorting data frame by name 
df.sort_values("load_timestamp", ascending = True, 
                 inplace = True, ) 
df.head()

Unnamed: 0,file_name,file_size,load_timestamp
9,s3_meta,319,20190701092831
8,s3_meta,319,20190702093120
7,s3_meta,321,20190703091121
6,s3_meta,343,20190704091956
5,s3_meta,376,20190705093748
