- Here is a page for [Markdown Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)
- Here is a page for [Python Library Documentation](https://docs.python.org/3/library/)

**_Quick tips:_**
* Variables are stored in the Kernel (backend processing engine of the codes). If Kernel was to restart, variables are gone and have to be reinitialized.
* in command mode (blue border), L toggles line numbers in the code cell.
* in command mode, M and Y toggle between markdown and code mode.
* to comment our multiple lines, select them and ctrl + /
* to have no output when you open the file, from the Kernel menu, restart and clear output or go to cell, All output and clear
* to toggle between showing and not showing output, simply use the key O (I changed this to be Clear Cell Output shortcut)
* to edit keyboard shortcuts, click on H and assign or edit
* to bring the command pallete: P, then quickly type what you want and do it

1. Numerical/quantitative variables divided into continuous (floats) and discrete (integers)
2. Categorical variables are nominal (boolean, string, None), and ordinal (e.g. lists since they have indeces)
3. Mean of a list of boolean variables will count the number of True (1) values and divide by the number of items!
4. Shift + L puts a number on the lines of code so you can see how many lines are written in each cell. Pressing L on a cell will remove line numbers for that cell.

In [None]:
pip list # to see a list of the libraries installed on your machine

In [None]:
# importing some libraries that will be needed
import numpy as np   # for computation purposes with tools for arrays and numerical data
import pandas as pd  # for data manipulation and analysis with easy-to-use data structures and data analysis tools
import math          # for various mathematical functions and constants for numerical computations
import sys
import string        # for string manipulation functions such as concatenation, case conversion, trimming, and search
import seaborn as sns # to create visually appealing and informative statistical graphics 
import matplotlib.pyplot as plt # for creating data visualizations in the form of charts, graphs, and plots
# another way to write the above is from matplotlib import pyplot as plt

%matplotlib inline # to allow the user to display matplotlib plots directly in the notebook output cell.

In [7]:
from pydataset import data # to use the data() function to load datasets (available in R) in python and do things!

In [None]:
pd.options.display.max_columns = None # to remove the limit on the number of df columns shown in output
pd.options.display.precision = 3 # to have 3 decimal points as a global option in dataframes
np.set_printoptions(precision = 3) # to have 3 decimal points as a global option in simple print outputs
np.set_printoptions(suppress = True) # to avoid scientific notation in the outputs (not related to numbers Pandas' df)

This is just my pet peeve! To set the global option to show 3 decimals for all floats in pandas dataframes, and will apply to all dataframes created or shown afterwards. This is useful when a new column is created and it shows in scientific format! Otherwise the `pd.options.display.precision = 3` does the job for existingn dfs.

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# to revert the above: use pd.set_option('display.float_format', None)

# Some General Functions

In [None]:
sys.version # to check the version of the python we're running

In [None]:
# %whos # to see what objects/variables, etc. we have in the memory - only works in IPython and not in shell or Jupyter!
dir() # returns a list of names in the current local scope or a specified object.

In [None]:
n = [1, 3, 6, 8, 13, 1, 4]
print(type(n))
sum(n) # sum of the numbers in the list

In [None]:
len(n) # length of the list

In [None]:
x = round(math.pi, 4)  # the number pi from the math library
print(x) ; del x ; x

In [None]:
str(math.pi) # to convert data type to string

In [None]:
l = ([True, 6 == 5, 4 > 3, None is None, 7]) 
print(l)
sum(l)/len(l) # there are 3 true values and a 7 totalling 10 divided by 5 gets 2 (as float)

In [None]:
l.sort() # to sort the items inside a list which actually changes the ordring in that list too
l

In [None]:
np.mean(l)

In [None]:
[None]*5 # to create a list containing 5 None items
type(None)

In [None]:
a = np.array([0,1,2,3,4,5,6,7,8,9,10])
type(a)

# Data Types
In Python, there are several built-in data types: 
1. Numeric: int, float, complex
2. Sequences: list, tuple, range
3. Text: str
4. Mapping: dict
5. Set: set, frozenset
6. Boolean: bool
7. Binary: bytes, bytearray
8. None: NoneType

In addition to these built-in data types, Python also supports ***custom data types***, such as **classes** and **objects**.
Let's Look at the Sequences:

## Strings

In [None]:
txt = "She said \"Never let go\"."  # Backslashes (\) are used to escape characters
print(txt)
print("length of txt is " + str(len(txt)) + " characters")
print("m" in txt)

In [None]:
txt  = "Gishar"
print(txt[1] + ", " + txt[-1] + ", " + txt[4:6] + ", " + txt[:4] + ", " + txt[-3:])

In [None]:
for letter in txt:
    print(letter)

In [None]:
txt.lower() + ' ' + txt.upper()  # to make string all lower letter or all capital letters

In [None]:
("   There were three chars first and 5 at the end!     ").strip() # the strip method to remove chars from start and end

In [None]:
("   There were three chars first and 5 at the end!     ").strip().title() # to return string with first letter capitalized

In [None]:
txt.split('s') # use a character to split text

In [None]:
# practice with split method and set function! One "the" & "dove" remained. if upper was not used, the and The were there!
set("The dove dove into the water".upper().split())

In [None]:
# contiue from here: https://www.codecademy.com/learn/learn-python-3/modules/learn-python3-strings/cheatsheet

## Integer, Float, Boolean

In [1]:
x = 1 ; y = 2.3
type(x) # type to see the type of an object

int

In [None]:
float(x) # to convert to float

In [None]:
float(True) # convert boolean to float which gives 0.o for false and 1.0 for true

In [None]:
int(y) # to convert to integer

In [None]:
str(y) # to convert to string

In [None]:
print(1>3) # print the logical result of operation
bool(0) # convert 0 to boolean is equal to False and any other number is equal to 1

## Lists
Lists in Python are a collection of items, ordered and mutable. A list can contain items of different data types, such as integers, floating-point numbers, strings, and so on. Lists are defined using square brackets, with the items separated by commas. 

Lists allow you to store, manipulate and retrieve elements efficiently, making them one of the most commonly used data structures in Python. You can access individual elements using indexing, add elements to a list using the _append_ method, remove elements using the _remove_ method, and _sort_ the list using the sort method, among other operations.

In [None]:
heights = [["Noelle", 61], ["Ava", 70], ["Sam", 67], ["Mia", 64]] # 2D list is a good structure for representing grids
myclass = [
        ["Kenny", "American", 9],
        ["Tanya", "Ukrainian", 9],
        ["Madison", "Indian", 7]
        ]
print(heights[2][1]) # to call an item in a 2D list we use two [], first to call the list index, second to call the item index in the list
print(myclass[1]) # print the whole list in the 2D list called by its index
print(myclass[-2][-2])

In [None]:
myclass[1][1] = "Albanian" # same index as [-2][-2]
print(myclass[-2][-2])

In [None]:
myclass[1].remove("Tanya") # to remove or append an item from/to a list inside a 2D list, apply the function on the specific list inside!
myclass

In [None]:
myclass.sort() # to sort the list (based on the first elements of each list inside a 2D list)
myclass

In [None]:
x = [] # to create an empty list in an implicit way
x = list() # to create an empty list in an explicit way

In [None]:
x = [3, 1, 20, 2, 2, 3, 5, "hi"]  # a list can contain anything
print(x[1]) # to see the item on index 1 (indices on lists start from 0)
print(x[2:6]) # to see the items on index 2 to 5 (last number exclusive)

In [None]:
x.count(3)

In [None]:
x.append(2)
x

In [None]:
# different ways to update or extract info from a list
x.append([True]) # Appends the item to the end of the list
x.append(True) # Appends the item to the end of the list
x

In [None]:
x[1:2]=["Two", "three"]
x += [23, "No", False] # another way to add to a list other than using append which only adds 1 item. we can use extend too
x

In [None]:
# [None] is an list with 1 Null or No-value itemNone is not the same as 0, False, or an empty string. None is a data type of its own (NoneType) and only None can be None.
x.append([None]*5)

In [None]:
x.remove(2) # removes the item from the list the first time it shows up in the list (for removing by index, use pop() method)

In [None]:
x = x[0:5]

In [None]:
x.extend([6, "Yes", False])  # to append more than 1 item, we extend: equivalent to concatenating two lists

In [None]:
x.insert(0, 3) # to insert a new entry into a list using index: .insert(index, value)
x.insert(-4, "text this time")

In [None]:
x.pop(1) # to remove an item from a list using index: .pop(index) - without an index it simply removed the last one
x

In [None]:
r = x.pop(1) # we can save the removed value to a variable if we care to use it later
print(r)  # stores the removed (popped) item into another variable
print(x)

In [None]:
name = list("GISHAR")
print("name = ", name) # this is what list function does to strings.

In [None]:
name.sort() # this is a method to sort the original list. after this is done, if you print the list, you'll see the sorted version
print("soreted name = ", name)

In [None]:
name.sort(reverse=True)
print("reverse sorted name = ", name)

### Numpy Arrays
NumPy arrays are **multi-dimensional arrays** used in numerical computing with Python. They allow for fast computation and manipulation of large arrays of data. They support a variety of operations including *element-wise operations, slicing, and indexing*. **NumPy arrays are more efficient than regular Python lists for large data sets and are widely used in data analysis and scientific computing**.

Let's look at these common stuff from np: 

`np.array()`, `np.arange()`, `np.zeros()`, `np.ones()`, `np.linspace()`, `np.eye()`, 

`np.random.rand()`, `np.random.randn()`, `np.random.randit()`,  

`reshape()`, `.shape`, `.dtype`, `.max()`, `.argmax()`, `.min()`, `.argmin()`

In [None]:
np.random.seed(101) # to use a specific seed for generating random numbers

In [None]:
mylist = [3, 1, 20, 2, 2, 3, 5]
nplist = np.array(mylist)
print(mylist)
print(nplist)

In [None]:
mymatrix = [[1,2,3],[4,5,6],[7,8,9]]
npmatrix = np.array(mymatrix)
print(mymatrix)
print(npmatrix)

In [None]:
nprange = np.arange(1, 50, 3) # generating a range using numpy's arange function
print(nprange)

In [None]:
x = np.zeros(4) # generate a 1x4 matrix of zeros
y = np.zeros((3,3)) # generate a 3x3 matrix of zeros
z = np.ones((2,3)) # generate a 2x3 matrix of ones
print(x) ; print(y) ; print(z)

In [None]:
x = np.linspace(0.5, 20, 40) # generate 40 numbers between 0.5 and 20, evenly spaced, starting from 0.5
y = np.eye(4) # genearte identity matrix of NxN
print(x); print(y) 
y.shape # dimension of the object

In [None]:
np.random.rand(3, 2, 3) # generate random numbers from uniform distribution between 0 and 1 placed in the form asked

In [None]:
np.random.randn(3, 2) # generate random numbers from std normal distribution placed in the form asked

In [None]:
np.random.randint(1, 100, 6) # generate 6 random numbers between 1 and 100

In [None]:
x = np.arange(30)
print(x)
x.reshape(5,6) # to reshape the data into a matrix form 

In [None]:
# argmax returns the index of the max and similarly argmin
x = np.random.randint(1, 100, 6)
print(x); print(x.max()); print(x.argmax()); print(x.min()); print(x.argmin())

In [None]:
print(x.dtype) # type of the data stored in the object
print(type(x)) # type of the object itself

### Range
a built-in function that generates a sequence of numbers. It allows you to generate a sequence of numbers from a starting number to an ending number with a specified step.

In [None]:
y = range(10) # range(start, end/exclusive, step) is unique in that it creates a range object! to use it as a list, we have to convert it using the list() function
print(y)
list(y)

In [None]:
z = list(range(2, 20, 2))
print(z)
len(z) # to see the length of a list

In [None]:
names = ["Jenny", "Alexus", "Sam", "Grace"]
heights = [61, 70, 67, 64]
# takes two (or more) lists as inputs and returns an object that contains a list of pairs. Each pair contains one element from each of the inputs.
names_and_heights = zip(names, heights)
print(names_and_heights)
print(list(names_and_heights))

## Tuple
A tuple is a collection of **ordered, immutable, and heterogeneous** data elements in Python. It's similar to a list, but you can't modify its elements once it's created. Tuples are declared using round brackets ( ) and its elements are separated by commas.

Tuples are commonly used to represent ordered collections of **data that shouldn't change**, for example, coordinates in a map, or dates in a calendar. They're also **faster than lists for some operations and use less memory, making them ideal for data that doesn't need to be modified**.

In [None]:
mytuple = ('Mike', 24, 'Programmer')
mytuple[0] # works the same way as in lists
# mytuple[0] = "Joe" # this will error due to immutability of tuples

In [None]:
print(mytuple[1:]) # all the same like those in list but nothing can change with tuples
name, age, occupation = mytuple # unpacking a tuple by putting variables equal to tuple (no of vars equal to no of elements in tuple - order matters)
x = (4,) # to create a 1-element tuple we need the , in there. Otherwise, it's just a number and not a tuple
x

## Set
A set in Python is an collection of unique elements. Sets are commonly used to **remove duplicates from a list** or to perform **mathematical set operations such as union, intersection, and difference**. They are defined using curly braces {} or the built-in **set()** function.
- lists and tuples can have repetitive values, unlike sets
- it seems that sets are ordered, no matter how you enter the values when defining it!
- tuples can't change! sets and lists can

In [None]:
fruits = {"apple", "banana", "cherry", "apple", "apple", "cherry"}
fruits

In [None]:
mylist = [1, 20, 2, 2, 3, 5, 3, 2, 3, "hi"]
print(mylist) ; print(type(mylist))
myset = set(mylist)
print(myset) ; print(type(myset))
mytuple = (1, 20, 2, 2, 3, 5, 3, 2, 3, "hi")
print(mytuple) ; print(type(mytuple))

In [None]:
y.add(1.5) # to add an element to a set (can't be done to a tuple) (for list, we use append attribute)
y

In [None]:
mytuple.count(2) # how many of these values are in the tuple (count is used in lists too)

In [None]:
mytuple.index(3) # provide the index number for the value

In [None]:
mylist.index(3) # the index for the only the first time values shows up in the list

## Dictionary
A dictionary is a collection of key-value pairs. It is an **unordered, mutable, and indexed data structure**. You can access, add, remove, and update the items in a dictionary using the keys. The keys in a dictionary must be unique and can be of any hashable data type (e.g. string, integer, etc.). The values can be of any data type. The syntax to define a dictionary is using curly braces {} with key-value pairs separated by colons. For example: my_dict = {'key1': 'value1', 'key2': 'value2'}

In [None]:
person = {"name": "John Doe", "age": 30, "city": "New York", 'Favorite Numbers':[3, 13, 42]} # create a dictionary
type(person)

# Pandas Dataframe
A dataframe is another name for dictionary when we use pandas library to work with dictionaries. It is a two-dimensional labeled data structure that can store data of different types. It is a commonly used data structure for data analysis and manipulation. Dataframes are created and manipulated using the Pandas library in Python. They are similar to spreadsheets or SQL tables, allowing you to store, manipulate, and analyze data efficiently. Each column in a dataframe is a pandas Series, and each row is represented by an index. Dataframes have methods and attributes that allow you to perform operations such as selecting rows and columns, filtering, grouping, and aggregating data, and much more.
- `.assign()` to add new columns without modifying the original df (like in a new df, similar to the `mutate` in R's `dplyr`)
- `query()` to filter based on conditions, similar to SQL
- `sort_values()` to sort a DataFrame by one or multiple columns
- `.set_index` to set an index column
- `.name` to name a column

Sme of the additional useful methods that I may come back to them or may come up in the data analysis learning files:
- `.sample()` to randomly sample rows from df
- `.fillna()` to fill the missing values with a specified value or method
- `.pivot_table()` to do pivot tables out of the df, obviously! Dah!
- `.groupby()` to perform aggregate operations on the groups
- `.transpose()` to transpose df, Dah!
- `.merge()` similar to SQL's joins to combine dfs based on one or more commmon columns
- `.rename()` to change name of columns
- `.to_csv()` to export to a csv file on the drive

In [None]:
pd.DataFrame(person) # making a dataframe from the dictionary created above using pandas

In [None]:
type(pd.DataFrame(person))

In [None]:
# first takes a matrix, then index = for rownames, columns = for colnames
np.random.seed(101)
df = pd.DataFrame(np.random.randn(5, 4), 
                  index = 'A B C D E'.split(),
                  columns='W X Y Z'.split()
                 )
df

In [None]:
df.index # to extract the index of a dataframe

In [None]:
df['W'] # to get a column

In [None]:
df[['X', 'Y']] # to get multiple columns

In [None]:
df[0:2] # to get rows by index numbers

In [None]:
df.loc['A'] # to get rows by index labels

In [None]:
df.loc[['C', 'D'], ['X', 'Y']] # to get a bunch of rows and columns from dataframe using index labels

In [None]:
df.iloc[0:2] # also to get rows by index number (in this case first two rows) similarly, df.iloc[:2]

In [None]:
df.iloc[:, 0] # to get the first column

In [None]:
df.iloc[:, :3] # to get the first 3 columns

In [None]:
df.iloc[2:4, 1:3] # to get a bunch of rows and columns from dataframe using index labels

In [None]:
df>0 # to make a condition out of the data in the dataframe

In [None]:
df[df>0]

In [None]:
df[df['W']>0]

In [None]:
df[df['W']>0]['X']

In [None]:
df[df['W']>0].iloc[1]

In [None]:
df[(df['W']>0) & (df['Y'] > 1)]

In [None]:
# drop = True to remove the original index / inplace is used to replace the original df
df2 = df.reset_index(drop=True, inplace=False) 
df2

In [None]:
df2['MyIndex'] = list(string.ascii_uppercase)[10:15] 
df2.set_index('MyIndex', inplace=True) 
df2

In [None]:
df = pd.concat([df1, df2], axis = 0) # to concatenate two or more dataframes vertically
df = pd.concat([df1, df2], axis = 1) # to concatenate two or more dataframes horizontally

In [None]:
df2 = df.assign(VehicleBMI = df['CurbWeight']/df['Height']**2) # create a new variable from existing variables into a new df

In [None]:
df.query("CityMPG > 35 and Fuel == 'gas' and Horsepower < 60") # query from data (slower than boolean indexing e.g. below)

In [None]:
df[df['CityMPG'] > 35][df['Fuel'] == 'gas'][df['Horsepower'] < 60]

In [None]:
df.head()

In [None]:
df.index.name = "MyIndex"

In [None]:
df.sort_values(by = 'Length', ascending = False).head()

In [None]:
df.sort_values(by = ['Length', 'Height'], ascending = [True, False])

<div style="line-height:0.5">
<strong> Multi-Index in Pandas </strong>

Creating *multi-indexed* dataframe using Multiindex in Pandas 
`.from_tuples` is to make something from a list of tuples
</div>    

In [None]:
outside = ['G1','G1','G1','G2','G2','G2'] 
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside, inside)) # zip creates tuples of pairs for each outside and corresponding inside lists
# print(hier_index)
hier_index = pd.MultiIndex.from_tuples(hier_index) # use the list of zipped tuples and make a MultiIndex object
# print(hier_index)
df = pd.DataFrame(np.random.randn(6, 2), # make a dataframe with the MultiIndex object as the index
                  index = hier_index,
                  columns = ['A','B']
                 )
df

In [None]:
df.loc['G2']

In [None]:
df.loc['G2'].loc[3]['B']

In [None]:
df.index.names

In [None]:
df.index.names = ['Group', 'Number']
df

In [None]:
df.xs(('G1', 3)) # returns a specific row in multiindexed dataframe

In [None]:
df.xs(2, level = 'Number') # returns the row from all indexes where the Number index = 2

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/gishar/Learning_Python/main/Starwars.csv")
df.head(4)

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/gishar/Learning_Python/main/Starwars.csv",
                 sep = ',', 
                 header = 0,        # take the first row as the header
#                  index_col = 0,     # take the first column as the index
#                  skiprows = 5,      # skip the first 5 rows 
                 na_values = 'N/A'  # change "N/A" values to NaN
                ) 
df.head(4)

In [None]:
type(df)

In [None]:
df.columns # to see the name of the variables / columns

In [None]:
df.head(3) # see the top 3 rows of data

In [None]:
df.tail(2)

In [None]:
# .loc uses labels/column names to call the data from the dataframe
df = df.loc[:, ["cones", "ntrees", "dbh", "height", "cover", "sntrees", "sheight", "scover"]] # removed the index column
df.columns # to see the name of the variables / columns

In [None]:
df.loc[:,"ntrees"].head(2)

In [None]:
df.loc[:3, ["cones", "ntrees", "cover", "height"]] # show 3 observation for these columns in order that is written

In [None]:
df.loc[5:7] # sepefic rows for all columns

In [None]:
# .iloc uses integer numbers to slice the dataframe
df.iloc[:4] # first 4 rows of the dataframe
df.iloc[1:5, 2:4] # rows 1 to 5 (exclusive) and columns 2 to 4 (exclusive on the 4) - note, index starts from 0

In [None]:
df.cones.unique() # to get the unique values in a column of a dataframe

In [None]:
df['cones'].unique()

In [None]:
# df.groupby(["cones", "ntrees"]).size()  # to group by the combination of two or more variables and show a sumamrization value (size, mean, etc.)

In [None]:
pd.crosstab(df['hair_color'], df['sex']) # cross tabulation between two variables

In [None]:
df['sex'].value_counts() # find count of unique values for each category 

# Functions

In [None]:
def myfunction():
    """
    This is a description of what this function do. It comes handy when reviewing or sending it to other people
    """
    print('Hello')
    print('Oh this is fun')
myfunction()

In [None]:
# an example of creating a function
def biglittle():
    text_with_no_space = input("write some text without spaces:")
    funny = max(text_with_no_space) + " " + min(text_with_no_space)
    return funny

In [None]:
biglittle()

In [None]:
# another example of a function
def greet(lang): # the function receives one input, simply labeled lang here and does work with it and returns something 
    if lang == "es":
        print("Hola!")
    elif lang == "fr":
        print("Bonjour!")
    else:
        print("Hello!")

greet("fr") # sends the first parameter into the function

In [None]:
# another example of a function
def greet(lang): # the function receives one input, simply labeled lang here and does work with it and returns something 
    if lang == "es":
        return "Hola!"
    elif lang == "fr":
        return "Bonjour!"
    else:
        return "Hello!"

print(greet("fr"), "Jean-claude")

In [None]:
# another example of a function
def addtwo(x, boo): # a function can have receive mroe than one parameter, label them as they come in, work with them and return 
    """
    This function adds the two input numbers and returns their sum
    """
    added = x + boo
    return added # return usually is the last line in a fucntion

try:
    a = float(input("Enter first number: "))
    b = float(input("Enter second number: "))
    print("the sum is equal to: ", addtwo(a,b))
except:
    print("Can't do that! Please enter numbers only")

# Indefinite Loop (While)

In [None]:
# While loops are called indefinite loops but mostly we use definite loops
n = 5 # this is the iteration variable - will need to change or otherwise the loop goes forever
while n > 0:
    print(n)
    n -= 1

print("It's Over")

In [None]:
n = 0
while True:
    line = input("Say my name: ")
    if line == "Done":
        break
    print(line)

print("Finally it's over!")

# Definite Loop (For)

In [None]:
n = list(range(1,6)) # just defining a list of numbers to work with in the For loop
for x in n:
    print(x)
    print("Oy!")

In [None]:
x = range(1, 20, 5)
out = []
for item in x:
    out.append(item**2)
print(out)

In [None]:
guys = ['ali', 'hasan', 'haji']
for name in guys:
    print("Happy new year, ", name)

In [None]:
a = 'abc&xyz'  # string
b = 3.16  #float
c = {2:'a', 3:'b'}  # dictionary
d = 6 < 2  # boolean
e = [1,2,3,4,5] # list
f = sum(e)  # value returned from a function
g = sum # a function

for obj in [a,b,c,d,e,f,g]:
    print(obj)

# Data Wrangling / Cleaning
Following [Mona hatami](https://github.com/monahatami1)'s notebooks on this subject.
The process of cleaning, transforming, and mapping data from various sources into a format that can be used for analysis and visualization. I will learn how to handle *missing values*, *data formats*, and *data normalization*.
I will need the libraries: `pandas`, `numpy`, `seaborn`, `matplotlib.pyplot`.

Below, will go over the following functions, methods, etc.:
- If a dataframe does not have a header in it. We can use the `names` option to assign the header names as a defined list
- `.describe()` to get descriptive stat on each numerical variable in the df. This is similar to `summary` in R
- `.info()` to get details on the data type and null values for each columsn. Similar to `glimpse` in R
- `.unique()` to get unique values of a variable. Great for seeing factor levels
- `.replace(A, B, inplace = True)` to replace missing/strange/unknown values and update the dataset
- `.dropna(axis = 0)` to drop rows with missing values (0 is the default)
- `.dropna(axis = 1)` to drop columns with missing values
- `.isnull().sum()` to see number of null values in the variables (`isna()` does the same thing)
- `.astype('float')` to change the type of the data to another format
- `.idxmax()` to find the index/label of the max value in a Series or the row with the max value in a df column

To handle missing data, we can either drop the row, column, or replace the missing value by the mean, median, etc. of the variable.

In [None]:
headers = ["Symbol","NormalizedLoss","Make","Fuel","Aspiration", "Doors","Body", "DriveWheels","EngineLoc","WheelBase", 
           "Length","Width","Height","CurbWeight","EngineType", "Cylinders", "EngineSize","FuelSystem","Bore","Stroke",
           "CompressionRatio","Horsepower", "PeakRPM","CityMPG","HighwayMPG","Price"]

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/monahatami1/monogram1/master/imports-85.data", 
                 names = headers)
df.head(4)

In [None]:
df.describe(include = "all") # other choices for include or exclude are object for strings, number for numeric values

In [None]:
df.describe(percentiles=[0.1, 0.2, 0.3, 0.4])

In [None]:
df.describe(exclude = 'number')

In [None]:
df.info()

### Dropping Rows/Columns with Missing Values

In [None]:
# see unique values in all variables of the df - This will show us if there are missing data in our dataset.
for i in headers:
    print("Unique values in ", i, " = ", df[i].unique())

In [None]:
df.replace("?", np.NaN, inplace = True) # could use na_values = "?" when read_csv too

In [None]:
df.isnull().sum() # to see number of null values in the variables

In [None]:
df.isnull() 

In [None]:
df.isnull().any(axis = 0) # if any column has a missing value it will show as true

In [None]:
df.isnull().any(axis = 1) # if any row has a missing value it will show as true

In [None]:
df[df.isnull().any(axis = 1)] # using the above as condition to get all rows with some missing values

In [None]:
df.dropna(axis = 0) # to drop rows with missing values and update the df

In [None]:
df.dropna(axis=1) # to drop columns with missing values

In [None]:
df.drop(['var1', 'var2'], axis = 1) # to simply drop columns we don't want in the dataset

### Replacing Missing Values of Cells with Mean, Median, etc.

In [None]:
df.tail(4)

In [None]:
NormalizedLossMean = df['NormalizedLoss'].astype('float').mean() # find the mean value of a column
df['NormalizedLoss'].replace(np.nan, NormalizedLossMean, inplace = True) # replace missing values with mean

In [None]:
df.isnull().any(axis = 0)[df.isnull().any(axis = 0)] # show only columns with missing values

In [None]:
# find the average of values in numerical columns with missing values (only columns that makes sense)
StrokeMean = df['Stroke'].astype('float').mean(axis=0)  # axis = 0 is default so it won't matter
BoreMean = df['Bore'].astype('float').mean()
HorsepowerMean = df['Horsepower'].astype('float').mean()
RPMMean = df['PeakRPM'].astype('float').mean()
print("Average Stroke:", round(StrokeMean,3), 
      "\nAverage of Bore:", round(BoreMean, 3),
      "\nAverage Horsepower:", round(HorsepowerMean, 3),
      "\nAverage Peak RPM:", round(RPMMean, 3)
     )

In [None]:
# replace missing values with the mean values on those columns
df['Stroke'].replace(np.nan, StrokeMean, inplace = True)
df["Bore"].replace(np.nan, BoreMean, inplace=True)
df["Horsepower"].replace(np.nan, HorsepowerMean, inplace=True)
df["PeakRPM"].replace(np.nan, RPMMean, inplace=True)

In [None]:
print(df['Doors'].value_counts())
DoorsMode = df['Doors'].value_counts().idxmax() # to see which value has the max count (mode of variable) to replace missings
DoorsMode

In [None]:
df['Doors'].replace(np.nan, DoorsMode, inplace=True)

**We can use the `subset` parameter in dropna() to define the columns with missing values.** Everytime we drop rows, it's best to reset the indices.

In [None]:
df.dropna(subset = ["Price"], axis = 0, inplace = True)

In [None]:
df.head()

In [None]:
df.reset_index(drop = True, inplace = True) # drop true to drop the original index

In [None]:
df.info() # to do a last check on the variables with missing values - also can check their types

### Cleaning Data Formats
This is to remove those with Object format and put them into int, float, etc. for this purpose, `.dtype()` is used to check the type (for the whole df can use `df.dtypes`) and `.astype()` is used to cast a pandas object to a specified dtype. A type of 'object' means it's in strings form in Pandas.

In [None]:
df[["Bore", "Stroke","PeakRPM","Price"]] = df[["Bore", "Stroke","PeakRPM","Price"]].astype("float")
df[["NormalizedLoss"]] = df[["NormalizedLoss"]].astype("int")

### Data Normalization
ChatGPT: Data normalization is a technique used to transform data into a standardized scale, which allows different features or variables to be compared on an equal footing. The goal of normalization is to bring all the values of a dataset into a common range. This can help to mitigate issues with variable scaling, outliers, and improve the performance of some machine learning algorithms. Additionally, normalization can help to improve the interpretability of data and make it easier to visualize or communicate patterns and trends.

Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.

A specific case of this is ***data standardization*** which transform the data to follow a standard normal distribution with a mean of 0 and standard deviation of 1.

For example, if the heights of different animals are in the same df under different columns, you could normalize all between 0 and 1 to be able to compare them. Or, maybe you want to have all of your data in the meric system if there is a mix of them.

In [None]:
# example: new column defined as standardized Engine Size (following standard normal dist: mean 0 and stdev 1)
df['StEngineSize'] = (df['EngineSize']-df['EngineSize'].mean())/df['EngineSize'].std()

In [None]:
df.describe(percentiles=[])

In [None]:
df['NormalWidth'] = (df['Width']-df['Width'].min())/(df['Width'].max()-df['Width'].min())

In [None]:
set(list(df['NormalWidth'] > 1)) # shows there is no value that is more than 1

In [None]:
df['NormalWidth'][df['NormalWidth'] < 0][df['NormalWidth'] > 1].sum() # how many values are either less than 0 or more than 1

### Binning
It is the process of grouping continuous numerical values into discrete categories or 'bins'. It's also called discretization and is useful in visualizations, analyzing patterns or trends, or creating models that require categorical input variables. Pandas' `.cut()` method can be used to accomplish this.

In [None]:
df.head()

In the `astype()` method, the `copy` parameter specifies whether to return a new copy of the data (when copy=True) or modify the existing data in place (when copy=False, which is the default behavior). This ensures that the original data in df is not modified. If copy=False (or no option) is used instead, the original data in df would be modified directly.

In [None]:
df['Horsepower'] = df['Horsepower'].astype(int, copy = True)

In [None]:
df.head()

In [None]:
sns.histplot(data = df['Horsepower'], 
             bins = 3,
             kde = False,
             color = "red",
             fill = "blue",
            ) 
plt.show()

In [None]:
bins = np.linspace(min(df["Horsepower"]), max(df["Horsepower"]), 4) # divide range of Horsepower into 3 bins using 4 numbers
hpGroups = 'Low Medium High'.split() # define groups to put numbers in
print(bins, hpGroups)

In [None]:
df['HorsepowerBinned'] = pd.cut(df['Horsepower'], 
                                bins = bins, 
                                right = False, # whether the interval should be closed on the right
                                include_lowest = True, # whether the lowest edge of the leftmost bin should be included.
                                labels = hpGroups)
df[['Horsepower', 'HorsepowerBinned']].head(20)

In [None]:
df['HorsepowerBinned'].value_counts().plot.bar()

In [None]:
df.tail()

### One Hot Encoding
A technique that is used to transform categorical variables into a format that can be used for machine learning algorithms. When we have a categorical/factor variable with two or more levels (unique values) we can use Pandas' `get_dummies()` function to create dummy variables out of them. Dummy or indicator variables or binary variables are 0 and 1 variables to label categories. They indicate the presence or absence of some categorical effect, e.g. female and male for gender. 
- the `dummy_na = True` option for this function create an additional column with 1 for missing values and 0 otherwise

In [None]:
Dummy1 = pd.get_dummies(df['Fuel'])
Dummy1.head()

In [None]:
df = pd.concat([df, Dummy1], axis=1)

Let's see what other variables we have in the dataframe with more than two unique values and do the one hot encoding on it.

In [None]:
for i in df.columns:
    print("Unique values in ", i, " = ", df[i].unique())

In [None]:
pd.get_dummies(df['Body']) # let's run it on 'Body' which has 5 levels

In [None]:
df2 = pd.concat([df, pd.get_dummies(df['Body'])], axis = 1)

### Quick plots
By default `df.plot()` creates a line plot of all numerical columns in the DataFrame.
Using the `kind` argument we can choose from the available options: **line, bar, barh, hist, box, kde, density, area, pie, scatter, and hexbin**. 

Also, the followin options also exist to customize the plot: **title, xlabel, ylabel, legend, grid, xlim, ylim, xticks, yticks etc**. 

We can also use **plt.xlabel(), plt.ylabel(), plt.title() etc.** after the plot to customize it.

In [None]:
df['Fuel'].value_counts().plot(kind = 'bar', 
                               title = "Frequencies of each Fuel Type")
plt.xlabel("Fuel Type")