## <u> Introduction to Pandas </u>

  - **Introduction**
  
    - Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. 
    - Pandas incorporates two additional data structures into Python, namely Pandas Series and Pandas DataFrame. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. 
    - [Pandas Documentations](https://pandas.pydata.org/pandas-docs/stable/)
    
  - **Why to Use** 
    The recent success of machine learning algorithms is partly due to the huge amounts of data that we have available to train our algorithms on. However, when it comes to data, quantity is not the only thing that matters, the quality of your data is just as important. It often happens that large datasets don’t come ready to be fed into your learning algorithms. More often than not, large datasets will often have missing values, outliers, incorrect values, etc… Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well. Therefore, one very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis. This is where Pandas come in. Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Below are just a few features that makes Pandas an excellent package for data analysis:
     
       - Allows the use of labels for rows and columns
       - Can calculate rolling statistics on time series data
       - Easy handling of NaN values
       - Is able to load data of different formats into DataFrames
       - Can join and merge different datasets together
       - It integrates with NumPy and Matplotlib


In [None]:
!conda list pandas

### <u> Creating Pandas Series </u>

 - A Pandas series is a **one-dimensional array-like object** that can hold many data types, such as numbers or strings. 
 
 
 - **Pandas Series VS NumPy ndarrays**
 
   - you can assign an index label to each element in the Pandas Series 
   
   - you can name the indices of your Pandas Series anything you want
   
   - you can hold data of different data types in the Pandas Series
   
   
 - You can create Pandas Series by using the command: 
 
   - `pd.Series(data, index)` where index is a list of index labels.
   - default indices are numbers from 0 to len (Pandas will use numerical row indexes)
   - the data is indexed with the names provided
   - data in Pandas Series has both integers and strings.
   
  
 - **=> Attrubutes for Pandas Series**: allows us to get information from the series in an easy way.
 
   - `series_name.shape` gives shape (displayes number of columns & rows ) 
   - `series_name.ndim` gives dimension (1D for series, ND for dataframes... ) 
   - `series_name.size` gives total number of elements
   - `series_name.values` prints only values of elements
   - `series_name.index` prints only index labels
   - `'sample'in series_name` checks whether an index label exists,

In [None]:
# We import Pandas as pd into Python
import pandas as pd

# We create a Pandas Series that stores a grocery list
groceries = pd.Series(data = [30, 6, 'Yes', 'No'], index = ['eggs', 'apples', 'milk', 'bread'])
groceries_data = pd.Series(data = [30, 6, 'Yes', 'No'])
# We display the Groceries Pandas Series
groceries
groceries_data

In [None]:
# We print some information about Groceries
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')

In [None]:
# We print the index and data of Groceries
print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)

In [None]:
# We check whether * is a food item (an index) in Groceries
print('bananas' in groceries.index )
print('bread' in groceries ) #by default , it searches indices 
print(30 in groceries.values ) # could be used to search in elements

### <u> Arithmetic Operations on Pandas Series </u>

 - You can perform element-wise arithmetic operations on Pandas Series


 - You can also apply mathematical functions from NumPy, such as `sqrt(x)` to all elements of a Pandas Series.
 
 
 - Pandas also allows us to only apply arithmetic operations on selected items in our fruits grocery list
 
 
 - You can also apply arithmetic operations on Pandas Series of mixed data type provided that the arithmetic operation is defined for all data types in the Series, otherwise you will get an error. 
 
 
 - **=>Reminder! :**  original series doesn't chance after any of those operations


In [None]:
# We create a Pandas Series that stores a grocery list of just fruits
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])

# We perform basic element-wise operations using arithmetic symbols
print('Original grocery list of fruits:\n ', fruits)
print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2 
print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2

In [None]:
# We import NumPy as np to be able to use the mathematical functions
import numpy as np
# We apply different mathematical functions to all elements of fruits
print('EXP(X) = \n', np.exp(fruits))
print('SQRT(X) =\n', np.sqrt(fruits))
print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2

In [None]:
import pandas as pd
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])
# We add 2 only to the bananas
print('Amount of bananas + 2 = ', fruits['bananas'] + 2)
# We subtract 2 from apples
print('Amount of apples - 2 = ', fruits.iloc[0] - 2)
# We multiply apples and oranges by 2
print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)
# We divide apples and oranges by 2
print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)


In [None]:
# make sure operations are valid on all the data types of your elements.
print(fruits)
# the multiplication operation * is defined both for numbers and strings. 
print( fruits * 2 )
# If you were to apply an operation that was valid for numbers but not strings you will get an error. 
print( groceries / 2 )
#So when you have mixed data types in your Pandas Series make sure the arithmetic 

### <u> Creating Pandas DataFrames - manually </u>


- Pandas DataFrames are **two-dimensional data structures with labeled rows and columns**, that can hold many data types.


- We can create Pandas DataFrames manually or by loading data from a file


- **Option1: Creating DataFrame from a dictionary of Panda Series**
     - labels of rows and columns are displayed in bold
     - row labels of the DataFrame are built from the union of the index labels of the used Pandas Series 
     - column labels of the DataFrame are taken from the keys of the dictionary
     - the columns are arranged alphabetically and not in the order given in the dictionary
     - `NaN` values appear in the DataFrame. `NaN` stands for `Not a Number`, and is Pandas way of indicating that it doesn't have a value for that particular row and column index.(If we were to feed this data into a machine learning algorithm we will have to remove these NaN values first)
     - If we don't provide index labels to the Pandas Series, Pandas will use numerical row indexes when it creates the DataFrame.


- **Option2: Creating DataFrame from a dictionary of lists**

  - all the lists (arrays) in the dictionary must be of the same length ( which was not the case for series )
  
  - Pandas automatically uses numerical row indexes when it creates the DataFrame. We can however, put labels to the row index by using the index keyword in the pd.DataFrame() function.


- **Option3: Creating DataFrame by using a list of dictionaries**

   - The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the pd.DataFrame() function.
   
    - Pandas automatically uses numerical row indexes when it creates the DataFrame. We can however, put labels to the row index by using the index keyword in the pd.DataFrame() function.


In [26]:
# We import Pandas as pd into Python
import pandas as pd

# We create a dictionary of Pandas Series 
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
         'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}

# We print the type of items to see that it is a dictionary
#print(type(items))
#print(items)

# We create a Pandas DataFrame by passing it a dictionary of Pandas Series
shopping_carts = pd.DataFrame(items)

# We display the DataFrame
shopping_carts

Unnamed: 0,Bob,Alice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


In [None]:
# We create a dictionary of Pandas Series without indexes
items_no_index = {'Bob' : pd.Series([245, 25, 55]),
        'Alice' : pd.Series([40, 110, 500, 45])}

# We create a DataFrame
df = pd.DataFrame(items_no_index)

# We display the DataFrame
df

In [None]:
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# We create a DataFrame 
df = pd.DataFrame(data)

# We display the DataFrame
df

In [None]:
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# We create a DataFrame and provide the row index
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])

# We display the DataFrame
df

In [27]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame 
store_items = pd.DataFrame(items2)

# We display the DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,watches
0,20,,30,35
1,15,50.0,5,10


In [None]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])

# We display the DataFrame
store_items

In [None]:
print(items)
print()
print(items_no_index)
print()
print(data)
print(items2)

**=> Attributes for DataFrames** 

- Pandas DataFrames have attributes that allows us to get information from the series in an easy way.
 
   - `dataframe_name.shape` gives shape (displayes number of columns & rows ) 
   - `dataframe_name.ndim` gives dimension (1D , 2D ... ) 
   - `dataframe_name.size` gives total number of elements
   - `dataframe_name.values` displayes values of elements
   - `dataframe_name.index` displays index labels
   - `dataframe_name.columns` displays columns


**=> Retrieving a subset of data ** 

- Retrieving a subset of data instead of the entire dictionary by keywords `column` and `index` 

   - **`subset_of_frame = pd.DataFrame(dict_name, columns=['...']) `**
   - **`subset_of_frame = pd.DataFrame(dict_name, index=['...']) `**
   - **`subset_of_frame = pd.DataFrame(dict_name, columns=['...'],index=['...']) `**


In [None]:
# We print some information about shopping_carts
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

In [None]:
# We Create a DataFrame that only has Bob's data
bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])

# We display bob_shopping_cart
bob_shopping_cart

# We Create a DataFrame that only has selected items for both Alice and Bob
sel_shopping_cart = pd.DataFrame(items, index = ['pants', 'book'])

# We display sel_shopping_cart
sel_shopping_cart

# We Create a DataFrame that only has selected items for Alice
alice_sel_shopping_cart = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])

# We display alice_sel_shopping_cart
alice_sel_shopping_cart

### Accessing Elements in Pandas DataFrames 

- In general, we can access rows, columns, or individual elements of the DataFrame by using the row and column labels

- the labels should always be provided with the column label first, i.e. `dataframe[column][row] ` 

- If you provide the row label first you will get an error.


In [None]:
# We print the store_items DataFrame
print(store_items)

# We access rows, columns and elements using labels
print()
print('How many bikes are in each store:\n', store_items[['bikes']])
print()
print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])
print()
print('What items are in Store 1:\n', store_items.loc[['store 1']])
print()
print('How many bikes are in Store 2:', store_items['bikes']['store 2'])

### Modifying Elements in Pandas DataFrames 

- **=>Add a new column** 

  - To add new columns to our DataFrames :
  
    - `dataframe_name['new_column']=[new_values] `
    
  - The new column is added at the end of our DataFrame
  
  - Using arithmetic operations between other columns in our DataFrame :
  
    - `dataframe_name['new_column'] = store_items['column2'] + store_items['column2']` 
     
  - by using only data from particular rows in particular columns    
  
     - `dataframe_name['new_column'] = store_items['existing_columns'][:]`
        
  - The `dataframe.insert(loc,label,data`) method allows us to insert a new column in the dataframe at location `loc`, with the given column `label`, and given `data`. 
    
     - `dataframe_name.insert(location, 'column_name', values)` 
    
    
        
- ***=>Add a new row** 

    - To add rows to our DataFrame we first have to create a new Dataframe and then append it to the original DataFrame 
    
      - create a dictionary from a list of Python dictionaries
      - create new DataFrame with the new_items 
      - append new frame to existing DataFrame
      - use `sort=True` - the columns have been put in alphabetical order.


- ***=>Delete rows and columns** 

  - `.pop()` and `.drop()` methods are used to delete rows and columns from our DataFrame.
   
  - `.pop()` method only allows us to delete columns
       
    - `dataframe_name.pop('column_name')`
      
  - `.drop()` method can be used to delete both rows and columns by use of the axis keyword. 
   
    - axis = 1 refers to columns -  vertical lines  
    - axis = 0 refers to rows - horizontal lines
    - `dataframe_name = dataframe_name.drop(['column_name1', 'column_name2'], axis = 1)` 
    - `dataframe_name = dataframe_name.drop(['index_name1', 'index_name2'], axis = 0)` 
  
  
- ***=>change the labels**

  - using the `.rename()` method
    
    - `dataframe_name = dataframe_name.rename(columns = {'column_name': 'new_name'})`
    
    - `dataframe_name = dataframe_name.rename(index = {'index_name': 'new_name'})`
 
  - changing the index as one of the columns in the DataFrame is also possible  
 
    - `dataframe_name = dataframe_name..set_index('column_name')`
     

In [None]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])

# We display the DataFrame
store_items

In [None]:
# We add a new column named shirts to our store_items DataFrame indicating the number of
# shirts in stock at each store. We will put 15 shirts in store 1 and 2 shirts in store 2
store_items['shirts'] = [15,2]

# We display the modified DataFrame
store_items

In [None]:
# We make a new column called suits by adding the number of shirts and pants
store_items['suits'] = store_items['pants'] + store_items['shirts']

# We display the modified DataFrame
store_items

In [None]:
# We create a dictionary from a list of Python dictionaries that will number of items at the new store
new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]

# We create new DataFrame with the new_items and provide and index labeled store 3
new_store = pd.DataFrame(new_items,index = ['store 3'])
#new_store = pd.DataFrame(new_items)

# We display the items at the new store
new_store

In [None]:
# We append store 3 to our store_items DataFrame
store_items = store_items.append(new_store,sort=True)

# We display the modified DataFrame
store_items

In [None]:
# We add a new column using data from particular rows in the watches column
store_items['new watches'] = store_items['watches'][1:]

# We display the modified DataFrame
store_items

In [None]:
# We insert a new column with label shoes right before the column with numerical index 4
store_items.insert(4, 'shoes', [8,5,0])

# we display the modified DataFrame
store_items

In [None]:
# We remove the new watches column
store_items.pop('new watches')

# we display the modified DataFrame
store_items

In [None]:
# We remove the watches and shoes columns
store_items = store_items.drop(['watches', 'shoes'], axis = 1)

# we display the modified DataFrame
store_items

In [None]:
# We remove the store 2 and store 1 rows
store_items = store_items.drop(['store 2', 'store 1'], axis = 0)

# we display the modified DataFrame
store_items

In [None]:
# We change the column label bikes to hats
store_items = store_items.rename(columns = {'bikes': 'hats'})

# we display the modified DataFrame
store_items

In [None]:
# We change the row label from store 3 to last store
store_items = store_items.rename(index = {'store 3': 'last store'})

# we display the modified DataFrame
store_items

In [None]:
# We change the row index to be the data in the pants column
store_items = store_items.set_index('pants')

# we display the modified DataFrame
store_items

### Dealing with NaN

- We need to have a method for detecting and correcting errors in our data. While any given dataset can have many types of bad data, such as outliers or incorrect values, the type of bad data we encounter almost always is missing values. Pandas assigns NaN values to missing data


- **Combine the .isnull() and the sum()** 
  
  - .isnull() method returns a Boolean DataFrame of the same size
  - `x =  data_frame_name.isnull()` to label as True with numerical value 1 and False with numerical value 0
  - `x =  data_frame_name.isnull().any` to display NaNs along columns
  - `x =  data_frame_name.isnull().sum()` to count NaNs along columns
  - `x =  data_frame_name.isnull().sum().sum()` to count total NaNs


- **Count the number of non-NaN values** 

  - `data_frame_name.count()` to count non-NaNs along columns
  
  
- **Eliminate rows or columns that contain any NaN values**

  - `.dropna(axis)` method eliminates any rows with NaN values when axis = 0 is used 
  - `.dropna(axis)` method eliminates any columns with NaN values when axis = 1 is used
  - The original DataFrame is not modified. 
  - You can remove the desired rows or columns in place by setting the keyword `inplace = True` inside `dropna()`.


- **Eliminate NaN values**

  - **with value**  
    - `data_frame_name.fillna(value) `
    
  - **forward filling**
    - `data_frame_name.fillna(method = 'ffill', axis = 0)`
    - `data_frame_name.fillna(method = 'ffill', axis = 1)`
  
  - **backward filling**
    - `data_frame_name.fillna(method = 'ffill', axis = 0)`
    - `data_frame_name.fillna(method = 'backfill', axis = 1)`
  
  - **interpolation methods**
    - will use linear interpolation to replace NaN values using the values along the given axis. 
    - `data_frame_name.interpolate(method = 'linear', axis=0)`  replace NaN values by using column values
    - `data_frame_name.interpolate(method = 'linear', axis=1)` replace NaN values by using row values
  
  - The original DataFrame is not modified. 
  
  - You can replace the NaN values in place by setting the keyword `inplace = True` inside the `fillna()` function.
  


In [None]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

# We display the DataFrame
store_items

In [None]:
# We count the number of NaN values in store_items
x =  store_items.isnull().sum().sum()

# We print x
print('Number of NaN values in our DataFrame:', x)

In [None]:
# We print the number of non-NaN values in our DataFrame
print()
print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())

In [None]:
# We drop any rows with NaN values
store_items.dropna(axis = 0)

In [None]:
# We drop any columns with NaN values
store_items.dropna(axis = 1)

In [None]:
# We replace all NaN values with 0
store_items.fillna(0)

In [None]:
# We replace NaN values with the previous value in the column
store_items.fillna(method = 'ffill', axis = 0)

In [None]:
# We replace NaN values with the previous value in the row
store_items.fillna(method = 'ffill', axis = 1)

In [None]:
# We replace NaN values with the next value in the column
store_items.fillna(method = 'backfill', axis = 0)

In [None]:
# We replace NaN values with the next value in the row
store_items.fillna(method = 'backfill', axis = 1)

In [None]:
store_items

In [None]:
# We replace NaN values by using linear interpolation using column values
store_items.interpolate(method = 'linear', axis = 0)

In [None]:
# We replace NaN values by using linear interpolation using row values
store_items.interpolate(method = 'linear', axis = 1)

In [None]:
import pandas as pd
import numpy as np

# Since we will be working with ratings, we will set the precision of our 
# dataframes to one decimal place.
pd.set_option('precision', 1)

# Create a Pandas DataFrame that contains the ratings some users have given to a
# series of books. The ratings given are in the range from 1 to 5, with 5 being
# the best score. The names of the books, the authors, and the ratings of each user
# are given below:

books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Juliet', 'The Time Machine', 'Alice in Wonderland' ])
authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])

user_1 = pd.Series(data = [3.2, np.nan ,2.5])
user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])

# Users that have np.nan values means that the user has not yet rated that book.
# Use the data above to create a Pandas DataFrame that has the following column
# labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3', 'User 4'. Let Pandas
# automatically assign numerical row indices to the DataFrame. 

# Create a dictionary with the data given above
dat = {'Author' : authors,
         'Book Title' :books,
         'User 1' : user_1,
         'User 2' : user_2,
         'User 3' : user_3,
         'User 4' : user_4 
        }

# Use the dictionary to create a Pandas DataFrame
book_ratings = pd.DataFrame(dat)



# If you created the dictionary correctly you should have a Pandas DataFrame
# that has column labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3',
# 'User 4' and row indices 0 through 4.

# Now replace all the NaN values in your DataFrame with the average rating in
# each column. Replace the NaN values in place. HINT: you can use the fillna()
# function with the keyword inplace = True, to do this. Write your code below:

book_ratings.fillna(book_ratings.mean(), inplace=True)
                       

In [None]:
book_ratings

In [None]:
book_ratings.fillna(book_ratings.mean(), inplace=True)
book_ratings  

In [None]:
# best_rated = book_ratings[(book_ratings == 5).any(axis = 1)]['Book Title'].values
ratings_top = book_ratings[(book_ratings == 5).any(axis = 1)]
#ratings_only
ratings_top['Book Title']
ratings_top['Book Title'].values

In [None]:
ratings_best = book_ratings[book_ratings==5]
ratings_best

In [None]:
#store_items = store_items.set_index('pants')
book_ratings = book_ratings.set_index('Book Title')

In [None]:
book_ratings

In [None]:
ratings_best.dropna(axis = 0)

### <u> Creating Pandas DataFrames - loading data from a file </u>


- Pandas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types.


- We can create Pandas DataFrames manually or by loading data from a file


- to take a look at the first **few rows** of large data sets :
  - `dataframe_name.head(N)`
  - `dataframe_name.tail(N)`
  - `N` is number of rows to be listed, where default is 5 
  
  
- to check whether any of the columns contain **NaN values**: 
  - `dataframe_name.isnull().any()`
  
  
- to get **descriptive statistics** on each column of the DataFrame
  - `dataframe_name.describe()` 
    - displays mean, avg, count ... 
  - `dataframe_name['column name'].describe()`
  - `dataframe_name.max()` 
  - `dataframe_name.mean()` 
  - `dataframe_name['column name'].min()`
  - `dataframe_name.corr()` correlation along columns
    - A correlation value of 1 tells us there is a high correlation and a correlation of 0 tells us that the data is not correlated at all.



In [None]:
import pandas as pd
# We load Google stock data in a DataFrame
Google_stock = pd.read_csv('./goog-1.csv')

# We print some information about Google_stock
print('Google_stock is of type:', type(Google_stock))
print('Google_stock has shape:', Google_stock.shape)
Google_stock

In [3]:
Google_stock['Adj Close'].describe()

count    3313.000000
mean      380.072458
std       223.853780
min        49.681866
25%       226.407440
50%       293.029114
75%       536.690002
max       989.679993
Name: Adj Close, dtype: float64

In [4]:
Google_stock.corr()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
Open,1.0,0.999904,0.999845,0.999745,0.999745,-0.564258
High,0.999904,1.0,0.999834,0.999868,0.999868,-0.562749
Low,0.999845,0.999834,1.0,0.999899,0.999899,-0.567007
Close,0.999745,0.999868,0.999899,1.0,1.0,-0.564967
Adj Close,0.999745,0.999868,0.999899,1.0,1.0,-0.564967
Volume,-0.564258,-0.562749,-0.567007,-0.564967,-0.564967,1.0


In [8]:
Google_stock['High'].corr(Google_stock['Low'])

0.9998336413954032

### GROUPBY** 
 - `.groupby()`   

In [17]:
# We load fake Company data in a DataFrame
data = pd.read_csv('./fake_company.csv')

data

Unnamed: 0,Year,Name,Department,Age,Salary
0,1990,Alice,HR,25,50000
1,1990,Bob,RD,30,48000
2,1990,Charlie,Admin,45,55000
3,1991,Alice,HR,26,52000
4,1991,Bob,RD,31,50000
5,1991,Charlie,Admin,47,60000
6,1992,Alice,Admin,27,60000
7,1992,Bob,RD,32,52000
8,1992,Charlie,Admin,47,62000


In [18]:
# We display the total amount of money spent in salaries each year
data.groupby(['Year'])['Salary'].sum()

Year
1990    153000
1991    162000
1992    174000
Name: Salary, dtype: int64

In [20]:
# We display the average salary per year
data.groupby(['Year'])['Salary'].mean()

Year
1990    51000
1991    54000
1992    58000
Name: Salary, dtype: int64

In [21]:
# We display the total salary each employee received in all the years they worked for the company
data.groupby(['Name'])['Salary'].sum()

Name
Alice      162000
Bob        150000
Charlie    177000
Name: Salary, dtype: int64

In [22]:
# We display the salary distribution per department per year.
data.groupby(['Year', 'Department'])['Salary'].sum()

Year  Department
1990  Admin          55000
      HR             50000
      RD             48000
1991  Admin          60000
      HR             52000
      RD             50000
1992  Admin         122000
      RD             52000
Name: Salary, dtype: int64

In [23]:
data.groupby(['Department','Year'])['Salary'].sum()

Department  Year
Admin       1990     55000
            1991     60000
            1992    122000
HR          1990     50000
            1991     52000
RD          1990     48000
            1991     50000
            1992     52000
Name: Salary, dtype: int64