# DataFrames

### DataFrames are Pandas "tables" made up from columns and rows
* Each column of data in a DataFrame is a Pandas Series that shares tje same row index
* The column headers work as a column index that contains the Series names

### You can create a DataFrame from a Python dictionary or NumPy array by using the Pandas DataFrame() function

In [1]:
import numpy as np
import pandas as pd

In [2]:
''' 
It will be pretty unusual to create a DataFrame from a dictionary created in Python.
Typically you will load in a csv file or excel file.
This is just an example that DataFrames can be created many ways.
'''

pd.DataFrame(
    {"id": [1, 2],
     "store_nbr": [1, 2],
     "family": ["POULTRY", "PRODUCE"]
    }
)

Unnamed: 0,id,store_nbr,family
0,1,1,POULTRY
1,2,2,PRODUCE


## Create a DataFrame

In [3]:
oil = pd.read_csv(r'C:\Users\amorg\Documents\Maven NumPy and Pandas\maven_python_course\DataFrames\oil.csv')
oil

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


In [4]:
oil.shape # show the shape. 1218 rows and 2 columns

(1218, 2)

In [5]:
oil.index # shows the index

RangeIndex(start=0, stop=1218, step=1)

In [6]:
oil.columns # shows the column names

Index(['date', 'dcoilwtico'], dtype='object')

In [7]:
oil.axes # display the two axes which is colums and index

[RangeIndex(start=0, stop=1218, step=1),
 Index(['date', 'dcoilwtico'], dtype='object')]

In [8]:
oil.dtypes # datatypes of columns

date           object
dcoilwtico    float64
dtype: object

## Exploring a DataFrame

* You can explore a DateFrame using these mothods.

In [9]:
retail = pd.read_csv(r'C:\Users\amorg\Documents\Maven NumPy and Pandas\maven_python_course\DataFrames\retail_2016_2017.csv')
retail

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.000,0
1,1945945,2016-01-01,1,BABY CARE,0.000,0
2,1945946,2016-01-01,1,BEAUTY,0.000,0
3,1945947,2016-01-01,1,BEVERAGES,0.000,0
4,1945948,2016-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
1054939,3000883,2017-08-15,9,POULTRY,438.133,0
1054940,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


### .head() and .tail() methods return the top or bottom ros in a DataFrame
* This is a great way to QA data upon import!

In [10]:
retail.head() # .head(nrows) returns the first n rows of the DataFrame (5 by default)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [11]:
retail.tail() # .tail(nrows) Returns the last n rows of the DataFrame (5 by default)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
1054939,3000883,2017-08-15,9,POULTRY,438.133,0
1054940,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.0,8
1054943,3000887,2017-08-15,9,SEAFOOD,16.0,0


The .sample() method returns a random sample of rows from a DataFrame

In [12]:
retail.sample(5) # .sample(nrows, random_state=12345) Returns the n rows from a randome sample (1 by default). 
              # You can also specify a random_state argument to create an identical sample in another body of work or keep this one consistant

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
408481,2354425,2016-08-17,20,CLEANING,948.0,16
990287,2936231,2017-07-10,44,MAGAZINES,26.0,0
571,1946515,2016-01-01,25,EGGS,230.0,24
846340,2792284,2017-04-20,6,"LIQUOR,WINE,BEER",46.0,2
191413,2137357,2016-04-17,3,GROCERY II,83.0,0


### The .info() method returns details on a DataFrame's properties and memory usage

In [13]:
retail.info() # .info() Returns key details on DataFrame size, columns, and memory usage

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 48.3+ MB


The .info() method will show non-null counts on a DataFrame with less than ~1.7 million rows,
but you can specify show_count=True to ensure they are always displayed.

This is a great way to quickly idenify missing values - if the non-null count is less than the total number of rows,
then the difference is the number of NaN values in that column!

### The .describe() method returns key statistics on a DataFrame;s columns


In [14]:
retail.describe() # .describe() returns descriptive statistics for the columns in a DataFrame 
               # (only numeric columns by default; use the include=" " argument to specify more columns
               # you can also use .round() to suppress scientific notation to display more readable numbers

Unnamed: 0,id,store_nbr,sales,onpromotion
count,1054944.0,1054944.0,1054944.0,1054944.0
mean,2473416.0,27.5,457.7225,5.937977
std,304536.2,15.58579,1317.155,18.08632
min,1945944.0,1.0,0.0,0.0
25%,2209680.0,14.0,2.0,0.0
50%,2473416.0,27.5,24.0,0.0
75%,2737151.0,41.0,262.0,3.0
max,3000887.0,54.0,124717.0,741.0


In [15]:
retail.describe(include="all").round() # you can also use .round() to suppress scientific notation to display more readable numbers

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
count,1054944.0,1054944,1054944.0,1054944,1054944.0,1054944.0
unique,,592,,33,,
top,,2016-01-01,,AUTOMOTIVE,,
freq,,1782,,31968,,
mean,2473416.0,,28.0,,458.0,6.0
std,304536.0,,16.0,,1317.0,18.0
min,1945944.0,,1.0,,0.0,0.0
25%,2209680.0,,14.0,,2.0,0.0
50%,2473416.0,,28.0,,24.0,0.0
75%,2737151.0,,41.0,,262.0,3.0
