In [1]:
import pandas as pd
import numpy as np

# - Creating a Pandas Series: first main data structure in Pandas

In [2]:
groceries = pd.Series(data=[30, 6, 'Yes', 'No'], index=['eggs', 'apples', 'milk', 'bread'])

> A pandas series is a 1D array-like object that can hold **MANY data types**, such as numbers and strings.

> This is different from a NumPy array which can ONLY hold ONE data type

> Also, you can assign an **index label** to each element in the Pandas series (not the case for NumPy's arrays where there is only 0-indexing). In other words, you can name the indices of your Pandas Series anything you want.

In [3]:
groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

- **Pandas Series:**
> indices in the first column and the data in the second column

**The data is NOT indexed 0 to 3, BUT with the name of the foods i.e. whatever we passed in the argument 'index' of the Pandas Series 'groceries' above.**

# - Attributes to get info from Pandas Series

.shape: gives the sizes of each dimension of the data

In [4]:
groceries.shape

(4,)

.ndim: gives the number of dimensions of the data

In [5]:
groceries.ndim

1

.size: gives the total number of values in the array

In [6]:
groceries.size

4

.index: gives the index labels of the series object

In [7]:
groceries.index

Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')

.values: gives the data in the series object

In [8]:
groceries.values

array([30, 6, 'Yes', 'No'], dtype=object)

Check if a label is one the index labels of the series object:

In [9]:
'banana' in groceries

False

In [10]:
'bread' in groceries

True

> Therefore, `'banana'` is NOT one of the index labels in `groceries` series, but `'bread'` is.

# - Accessing elements in Pandas Series

In [11]:
groceries['eggs']  # to access the quantity of 'eggs' index label

30

In [12]:
groceries[ ['milk', 'bread'] ]   # providing a list of index labels

milk     Yes
bread     No
dtype: object

## We can also use *numeric* indices, just like in lists and NumPy ndarrays:

In [13]:
groceries[0]  # gets the value of the first element in the series ibject 'groceries'

30

In [14]:
groceries[-1]

'No'

In [15]:
groceries[ [0, 1] ]   # providing a list of NUMERICAL indices

eggs      30
apples     6
dtype: object

# - Labelled index or Numerical index?

In [16]:
groceries.loc[ ['eggs', 'bread'] ]   # explicitely states that we're using LABELLED index (loc = location)

eggs     30
bread    No
dtype: object

In [17]:
groceries.iloc[ [0, 2] ]   # explicitely states that we're using NUMERICAL index (iloc = integer location)

eggs     30
milk    Yes
dtype: object

# - Pandas Series are mutable like NumPy ndarrays.
> We can change the elements of a series AFTER it's been created.

In [18]:
groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

In [19]:
groceries['eggs']  = 2  # changing the value of the labelled index 'eggs' in the series object 'groceries'

In [20]:
groceries

eggs        2
apples      6
milk      Yes
bread      No
dtype: object

# Deleting elements from a Pandas Series
> Using the drop method

In [21]:
groceries

eggs        2
apples      6
milk      Yes
bread      No
dtype: object

In [22]:
groceries.drop('apples')   # deletes 'apples'

eggs       2
milk     Yes
bread     No
dtype: object

## NOTE: this drops elements of the series OUT of place, that is:

In [23]:
groceries

eggs        2
apples      6
milk      Yes
bread      No
dtype: object

> `'apples'` is still there.

>.drop() just returned the modified series object BUT didn't actually change the original series.

## Deleting elements of the series IN place (modifying the original series):

In [24]:
groceries.drop('apples', inplace = True)

In [25]:
groceries

eggs       2
milk     Yes
bread     No
dtype: object

> Now, `'apples'` has really been deleted from the original series object.

# - Arithmetic Operations on Pandas Series

Like for NumPy ndarrays, we can perform **element-wise** arithmetic operations on Pandas series.

> **here, we'll look at arithmetic operations between Pandas series and single numbers.**

In [26]:
fruits = pd.Series(data=[10, 6, 3], index=['apples', 'oranges', 'bananas'])

In [27]:
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [28]:
fruits + 2  # adding 2 to EACH element in series object 'fruits' 

apples     12
oranges     8
bananas     5
dtype: int64

In [29]:
fruits - 2  # subtracting 2 from EACH element in series object 'fruits' 

apples     8
oranges    4
bananas    1
dtype: int64

In [30]:
fruits * 2  # multiply by 2 EACH element in series object 'fruits'  

apples     20
oranges    12
bananas     6
dtype: int64

In [31]:
fruits / 2  # divide by 2 EACH element in series object 'fruits'  

apples     5.0
oranges    3.0
bananas    1.5
dtype: float64

## Apply mathematical functions from NumPy:

In [32]:
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [33]:
np.sqrt(fruits)  # getting the square root of EACH element in series object 'fruits'     

apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

In [34]:
np.exp(fruits)   # getting the exponential of EACH element in series object 'fruits'    

apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

In [35]:
np.power(fruits, 2)  # getting the power of 2 of EACH element in series object 'fruits'    

apples     100
oranges     36
bananas      9
dtype: int64

## Apply arithmetic operations on SELECTED items in a series

In [36]:
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [37]:
fruits['bananas'] + 2   # adds 2 JUST to the bananas item

5

In [38]:
fruits    # to modify bananas value, fruits['bananas'] = fruits['bananas'] + 2

apples     10
oranges     6
bananas     3
dtype: int64

In [39]:
fruits.iloc[0] - 2  # also works with numerical index

8

In [40]:
fruits[ ['apples', 'oranges'] ] * 2   

apples     20
oranges    12
dtype: int64

# - Creating a Pandas DataFrame: second main data structure in Pandas

> a 2-dimensional object with labeled rows and columns.

> It can hold multiple data types, like Pandas Series.

# - Creating DataFrames manually

## 1/ Creating a DataFrame manually from a dictionary containing several Pandas Series:

In [41]:
# step 1: create a python dictionary

# contains the shopping carts of two people
items = {'Bob' : pd.Series(data=[245, 25, 55], index=['bike', 'pants', 'watch']), 
         'Alice' : pd.Series(data=[40, 110, 500, 45], index=['book', 'glasses', 'bike', 'pants']) }

# each series contains the price of the items and each series is labeled with the items' names.

In [42]:
type(items)

dict

In [43]:
# step 2: passing the dictionary 'items' to the DataFrame function, to create the Pandas DataFrame

shopping_carts = pd.DataFrame(items)

In [44]:
shopping_carts

Unnamed: 0,Bob,Alice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


### Observations:

> DataFrames are displayed in a tabular form, much like a spreadsheet (e.g. in Excel), with the labels of the rows and columns in bold.

> Row labels of the DataFrame: built from the **union of the index labels we provided in the series (values of the 'items' dictionary)**

> Column labels of the DataFrame: taken from the **keys of the dictionary passed as argument to the DataFrame function**

> Notice the **NaN values** that appeared in the DataFrame
>> NaN stands for 'Not a Number'

## Extract info from a DataFrame using attributes:

.index: gets the index labels from our DataFrame  (the labels of the rows of the DataFrame)

In [45]:
shopping_carts.index

Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

.columns: gets the column labels from our DataFrame

In [46]:
shopping_carts.columns

Index(['Bob', 'Alice'], dtype='object')

.values: gets the data from our DataFrame

In [47]:
shopping_carts.values

array([[245., 500.],
       [ nan,  40.],
       [ nan, 110.],
       [ 25.,  45.],
       [ 55.,  nan]])

In [48]:
shopping_carts.shape

(5, 2)

In [49]:
shopping_carts.size

10

In [50]:
shopping_carts.ndim

2

## Passing only a *subset*  of a dictionary to the DataFrame function:

In [51]:
items   # recall this dictionary

{'Bob': bike     245
 pants     25
 watch     55
 dtype: int64,
 'Alice': book        40
 glasses    110
 bike       500
 pants       45
 dtype: int64}

In [52]:
# Creating a new DataFrame from a subset of 'items' dictionary:

bob_shopping_cart = pd.DataFrame(items, columns = ['Bob'])

bob_shopping_cart

Unnamed: 0,Bob
bike,245
pants,25
watch,55


In [53]:
# Creating a new DataFrame from a subset of 'items' dictionary:

selected_shopping_cart = pd.DataFrame(items, index = ['pants', 'books'])

selected_shopping_cart

Unnamed: 0,Bob,Alice
pants,25.0,45.0
books,,


In [54]:
# This DataFrame has only SELECTED items from ALICE's shopping cart:

alice_selected_shopping_cart = pd.DataFrame(items, columns = ['Alice'], index = ['glasses', 'bike'])

alice_selected_shopping_cart

Unnamed: 0,Alice
glasses,110
bike,500


## 2/ Creating a DataFrame manually from a dictionary of lists or arrays:

Constraint: all the lists/arrays in the dictionary (its values) MUST be of **same length**

In [55]:
# Creating a dictionary:
data = {'integers' : [1, 2, 3], 'floats' : [4.5, 8.2, 9.6]}

# Manually creating a Pandas DataFrame 'df':
df = pd.DataFrame(data)

df

Unnamed: 0,integers,floats
0,1,4.5
1,2,8.2
2,3,9.6


> Since the 'data' dictionary doesn't have index labels, Pandas automatically uses **numerical row indices** when it creates the 'df' DataFrame.

### But, we can add the index labels to the 'df' DataFrame:

In [56]:
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])

df

Unnamed: 0,integers,floats
label 1,1,4.5
label 2,2,8.2
label 3,3,9.6


## 3/ Creating a DataFrame manually from a list of Python dictionaries:

In [57]:
# Creates a list of dictionaries

items = [ {'bikes':20, 'pants':30, 'watches':35}, {'watches':10, 'glasses':50, 'bikes':15, 'pants':5} ]

In [58]:
# Creates a DataFrame from 'items':

store_items = pd.DataFrame(items, index = ['store 1', 'store 2'])

store_items

Unnamed: 0,bikes,pants,watches,glasses
store 1,20,30,35,
store 2,15,5,10,50.0


# - Accessing Elements in Pandas DataFrames

In [59]:
store_items

Unnamed: 0,bikes,pants,watches,glasses
store 1,20,30,35,
store 2,15,5,10,50.0


In [60]:
store_items[ ['bikes'] ]  # accessing the 'bikes' column

Unnamed: 0,bikes
store 1,20
store 2,15


In [61]:
store_items[ ['bikes', 'pants'] ]  # accessing multiple columns

Unnamed: 0,bikes,pants
store 1,20,30
store 2,15,5


In [62]:
store_items.loc[ ['store 1'] ]   # accessing the 'store 1' row

Unnamed: 0,bikes,pants,watches,glasses
store 1,20,30,35,


In [63]:
store_items['bikes']['store 1']   # accessing the value at a specific column and row
# column label always comes first, then the row label

20

## Modifying the DataFrame by adding rows and columns:

**- Adding columns to our DataFrame:**

In [64]:
# Adding a 'shirts' column to 'store_items' DataFrame:

store_items['shirts'] = [15, 2]

store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts
store 1,20,30,35,,15
store 2,15,5,10,50.0,2


In [65]:
# Adding a 'suits' column to 'store_items' DataFrame using arithmetic operations on existing columns:

store_items['suits'] = store_items['pants'] + store_items['shirts']

store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,suits
store 1,20,30,35,,15,45
store 2,15,5,10,50.0,2,7


**- Adding rows to our DataFrame:**

> **First step**: create a NEW DataFrame with those rows

> **Second step**: append the newly created DataFrame to the original DataFrame

In [66]:
# First step

new_items = [ {'bikes':20, 'pants':30, 'watches':35, 'glasses':4} ]

new_store = pd.DataFrame(new_items, index = ['store 3'])

new_store

Unnamed: 0,bikes,pants,watches,glasses
store 3,20,30,35,4


In [67]:
# Second step

store_items = store_items.append(new_store)

store_items  # appened the 'store 3' row to 'store_items' DataFrame

Unnamed: 0,bikes,pants,watches,glasses,shirts,suits
store 1,20,30,35,,15.0,45.0
store 2,15,5,10,50.0,2.0,7.0
store 3,20,30,35,4.0,,


**- Adding new columns to our DataFrame by selecting data that ALREADY exists in our DataFrame:**

In [68]:
# New column: 'new_watches', only for store 2 and store 3:

store_items['new_watches'] = store_items['watches'][1:]   # specifying store 2 and store 3 by numerical indexing

store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,suits,new_watches
store 1,20,30,35,,15.0,45.0,
store 2,15,5,10,50.0,2.0,7.0,10.0
store 3,20,30,35,4.0,,,35.0


**- Adding new columns to our DataFrame ANYWHERE we want:**

In [69]:
store_items.insert(5, 'shoes', [8, 5, 0])

# args of insert() = location, label, data

In [70]:
# see store_items DataFrame
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,shoes,suits,new_watches
store 1,20,30,35,,15.0,8,45.0,
store 2,15,5,10,50.0,2.0,5,7.0,10.0
store 3,20,30,35,4.0,,0,,35.0


## Modifying the DataFrame by deleting rows and columns:

**- Deleting columns from our DataFrame:**
> Using the pop method

> Using the drop method with the keyword: axis = 1

In [71]:
# Using the pop method

store_items.pop('new_watches')

store 1     NaN
store 2    10.0
store 3    35.0
Name: new_watches, dtype: float64

In [72]:
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,shoes,suits
store 1,20,30,35,,15.0,8,45.0
store 2,15,5,10,50.0,2.0,5,7.0
store 3,20,30,35,4.0,,0,


In [73]:
# Using drop method with axis = 1

store_items = store_items.drop(['watches'], axis = 1)

store_items

Unnamed: 0,bikes,pants,glasses,shirts,shoes,suits
store 1,20,30,,15.0,8,45.0
store 2,15,5,50.0,2.0,5,7.0
store 3,20,30,4.0,,0,


**- Deleting rows from our DataFrame:**
> Using the drop method with the keyword: axis = 0

In [74]:
# Using drop method with axis = 0

store_items = store_items.drop(['store 2'], axis = 0)

store_items

Unnamed: 0,bikes,pants,glasses,shirts,shoes,suits
store 1,20,30,,15.0,8,45.0
store 3,20,30,4.0,,0,


## Changing the labels of the columns and rows of our DataFrame
> Using the rename method

In [75]:
store_items

Unnamed: 0,bikes,pants,glasses,shirts,shoes,suits
store 1,20,30,,15.0,8,45.0
store 3,20,30,4.0,,0,


In [76]:
# Changing the label of the COLUMN 'bikes' to 'hats'

store_items = store_items.rename(columns = {'bikes':'hats'})   # original label as key and new label as value of the dic

store_items

Unnamed: 0,hats,pants,glasses,shirts,shoes,suits
store 1,20,30,,15.0,8,45.0
store 3,20,30,4.0,,0,


In [77]:
# Changing the label of the ROW 'store 3' to 'last store'

store_items = store_items.rename(index = {'store 3':'last store'})

store_items

Unnamed: 0,hats,pants,glasses,shirts,shoes,suits
store 1,20,30,,15.0,8,45.0
last store,20,30,4.0,,0,


# - Dealing with NaN: to *clean* our data

> Pandas assigns the value NaN to **missing data.**

In [78]:
items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes': 8, 'suits': 45},
         {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants': 5, 'shirts': 2, 'shoes': 5, 'suits': 7},
         {'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes': 10}]

store_items = pd.DataFrame(items, index = ['store 1', 'store 2', 'store 3'])
    
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


In [79]:
# To count the number of NaN values in our data
# using a combination of methods

NaN_number = store_items.isnull().sum().sum()

print(NaN_number)

3


## Once we know we have NaN in our data, we can either:
- remove them

OR

- replace them

### -  REMOVING NaN values:

In [80]:
# to REMOVE NaN values: drop.na() method

store_items.dropna(axis = 0)  # to remove any ROWS with NaN values

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 2,15,5,10,2.0,5,7.0,50.0


In [81]:
store_items.dropna(axis = 1)  # to remove any COLUMNS with NaN values

Unnamed: 0,bikes,pants,watches,shoes
store 1,20,30,35,8
store 2,15,5,10,5
store 3,20,30,35,10


### Note:  dropna() drops these rows and columns OUT of place (doesn't modify the original DataFrame)

### BUT you can remove rows/colums with NaN values IN PLACE using keyword `inplace = True` within dropna()

### - REPLACING NaN values:

e.g. replace all NaN values with 0

In [82]:
# Using the fillna() method

store_items.fillna(0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,0.0
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,0.0,10,0.0,4.0


#### 1/ FORWARD FILLING:

e.g. replace each NaN value with the previous value along the specified axis

In [83]:
# Using the fillna() method

store_items.fillna(method = 'ffill', axis = 0)  # NaN values replaced by the value in the previous row of their column

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,2.0,10,7.0,4.0


> Notice however: the NaN value in store 1 and glasses column did NOT get replaced:
BECAUSE there were NO previous values in the glasses column, since store 1 is the first row of the DataFrame.

This doesn't happen when we set axis = 1:

In [84]:
store_items.fillna(method = 'ffill', axis = 1)  # NaN values replaced by the value in the previous column of their row

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,35.0,10.0,10.0,4.0


#### 2/ BACKWARD FILLING:

In [85]:
store_items.fillna(method = 'backfill', axis = 0)  # NaN values replaced by the value in the FOLLOWING row of their column

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,50.0
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


In [86]:
store_items.fillna(method = 'backfill', axis = 1)  # NaN values replaced by the value in the FOLLOWING column of their row

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,10.0,10.0,4.0,4.0


### Note: .fillna() fills the NaN values OUT of place (doesn't modify the original DataFrame)

### Set the parameter: `inplace = True` so that it modifies the DataFrame instead of giving a copy of it

#### 3/ Replace NaN values by using different *interpolation methods*:

In [87]:
# Linear interpolation

## to replace NaN values using the values along the column axis:
store_items.interpolate(method = 'linear', axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,2.0,10,7.0,4.0


> The two NaN values in store 3 have been replaced with **linear interpolated values.**

> NaN value in store 1 did NOT get replaced, since there is no data before it to allow the interpolation function to calculate a value

In [88]:
# Linear interpolation

## to replace NaN values using the values along the row axis:
store_items.interpolate(method = 'linear', axis = 1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,22.5,10.0,7.0,4.0


### Note: .interpolate() fills the NaN values OUT of place (doesn't modify the original DataFrame)

# - Creating DataFrames: Loading Data into a pandas DataFrame

- Pandas allows us to load databases of different formats into DataFrames.
> One of the most popular format used to store data is **CSV** or 'comma separated values'.

**Loading CSV data into pandas DataFrame using read_csv():**

In [96]:
google_stock = pd.read_csv('Google_Stock_Price_Test.csv')  # file must be in the same directory

In [97]:
google_stock

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,1/3/2017,778.81,789.63,775.8,786.14,1657300
1,1/4/2017,788.36,791.34,783.16,786.9,1073000
2,1/5/2017,786.08,794.48,785.02,794.02,1335200
3,1/6/2017,795.26,807.9,792.2,806.15,1640200
4,1/9/2017,806.4,809.97,802.83,806.65,1272400
5,1/10/2017,807.86,809.13,803.51,804.79,1176800
6,1/11/2017,805.0,808.15,801.37,807.91,1065900
7,1/12/2017,807.14,807.39,799.17,806.36,1353100
8,1/13/2017,807.48,811.22,806.69,807.88,1099200
9,1/17/2017,807.08,807.14,800.37,804.61,1362100


In [98]:
print(type(google_stock))
print(google_stock.shape)

<class 'pandas.core.frame.DataFrame'>
(20, 6)


In [99]:
# take a look only at the FIRST 5 rows of the DataFrame:

google_stock.head()  # head method

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,1/3/2017,778.81,789.63,775.8,786.14,1657300
1,1/4/2017,788.36,791.34,783.16,786.9,1073000
2,1/5/2017,786.08,794.48,785.02,794.02,1335200
3,1/6/2017,795.26,807.9,792.2,806.15,1640200
4,1/9/2017,806.4,809.97,802.83,806.65,1272400


In [100]:
# take a look only at the LAST 5 rows of the DataFrame:

google_stock.tail()  # tail method

Unnamed: 0,Date,Open,High,Low,Close,Volume
15,1/25/2017,829.62,835.77,825.06,835.67,1494500
16,1/26/2017,837.81,838.0,827.01,832.15,2973900
17,1/27/2017,834.71,841.95,820.44,823.31,2965800
18,1/30/2017,814.66,815.84,799.8,802.32,3246600
19,1/31/2017,796.86,801.25,790.52,796.79,2160600


In [104]:
# choose the number of rows to return:
# e.g. :

google_stock.head(10) 

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,1/3/2017,778.81,789.63,775.8,786.14,1657300
1,1/4/2017,788.36,791.34,783.16,786.9,1073000
2,1/5/2017,786.08,794.48,785.02,794.02,1335200
3,1/6/2017,795.26,807.9,792.2,806.15,1640200
4,1/9/2017,806.4,809.97,802.83,806.65,1272400
5,1/10/2017,807.86,809.13,803.51,804.79,1176800
6,1/11/2017,805.0,808.15,801.37,807.91,1065900
7,1/12/2017,807.14,807.39,799.17,806.36,1353100
8,1/13/2017,807.48,811.22,806.69,807.88,1099200
9,1/17/2017,807.08,807.14,800.37,804.61,1362100


In [105]:
# Check if we have NaN values in the data set:

google_stock.isnull().any()    # .any() to check if ANY of the columns contain NaN values

Date      False
Open      False
High      False
Low       False
Close     False
Volume    False
dtype: bool

> Observation: we have NO missing data.

## Get statistical information from the DataFrame:

In [107]:
# for DESCRIPTIVE STATISTICS on each column of the data set:

google_stock.describe()

Unnamed: 0,Open,High,Low,Close
count,20.0,20.0,20.0,20.0
mean,807.526,811.9265,801.9495,807.9045
std,15.125428,14.381198,13.278607,13.210088
min,778.81,789.63,775.8,786.14
25%,802.965,806.735,797.4275,802.2825
50%,806.995,808.64,801.53,806.11
75%,809.56,817.0975,804.4775,810.76
max,837.81,841.95,827.01,835.67


In [108]:
# apply .describe() method on a SINGLE column:

google_stock['High'].describe()

count     20.000000
mean     811.926500
std       14.381198
min      789.630000
25%      806.735000
50%      808.640000
75%      817.097500
max      841.950000
Name: High, dtype: float64

### - You can look at one statistic instead, using the STATISTICAL FUNCTIONS provided by Pandas:

In [111]:
# maximum value in EACH column

google_stock.max()

Date      1/9/2017
Open        837.81
High        841.95
Low         827.01
Close       835.67
Volume     919,300
dtype: object

In [113]:
# mean of EACH column

google_stock.mean()

Open     807.5260
High     811.9265
Low      801.9495
Close    807.9045
dtype: float64

In [114]:
# minimum value in a specific column

google_stock['Open'].min()

778.81

### Another important data measure: data correlation

In [115]:
# to get the correlation between different columns

google_stock.corr()

Unnamed: 0,Open,High,Low,Close
Open,1.0,0.960636,0.972508,0.90769
High,0.960636,1.0,0.946877,0.947077
Low,0.972508,0.946877,1.0,0.95106
Close,0.90769,0.947077,0.95106,1.0
