## How to learn anything 

- Have an overview by watching resources on youtube, that how the work is gonna be.
- Just go through all of it, in one go... don't stop to look over anything, understanding not understanding all is fine.
- Take your time, and then rego through the material slowly, take notes on the side.
- Give your best to it.

### AWS

- [Basics of AWS](https://www.youtube.com/watch?v=gz3dr6o5gxI&list=PLlfy9GnSVerRwvzoRKor9txxfPq9c8FWE)
- [AWS Cloud Learning](https://d1.awsstatic.com/training-and-certification/ramp-up_guides/Ramp-Up_Guide_Machine_Learning.pdf) or use [Local pdf]("http://localhost:8888/lab/tree/Projects/Learning%20AWS/Ramp-Up_Guide_Machine_Learning.pdf")
- [AWS ML](https://explore.skillbuilder.aws/learn/lp/28/Machine%2520Learning%2520Learning%2520Plan)

### Web Scraping

[Definitions](https://iqss.github.io/dss-workshops/PythonWebScrape.html)

## Visualization

#### Important Viz codes

- Visualize Null Values in a dataset
``` python
sns.heatmap(viddf.isnull(),yticklabels = False, cbar = False, cmap = 'viridis')
plt.show()
```

- To find the actual name of parameter to be set
``` python
import matplotlib
for i in matplotlib.rc_params():
    if 'rotation' in i:
        print(i)
```

#### Getting Colors for the plots

``` python
colors = plt.cm.tab10.colors[:10]
#colors = sns.color_palette('Set1', 10)
channel_colors = {}
chdd.sort_values('subscribers',ascending=False,inplace=True)
for i, channel in enumerate(chdd['channelName']):
    channel_colors[channel] = colors[i]
```

#### Matplotlib Pyplot

``` python
fig = plt.figure(dpi, figsize=(16,20)) # Get a figure object
ax1 = fig.add_axes([left, bottom, width height]) #Create a single axis
ax2 = -//- # Create multiple axes in the same figure object using fig.
    
fig, ax = plt.subplot(nrows,ncols,dpi,figsize)# Get axes and figures directly via subplots.
ax[index].plot() or ax.plot()
ax.set_xticks() # Use indexing if setting it for a specific axes.
ax.set_yticks()
ax.set_xlim()
ax.set_ylim()
ax.set_xlabel()
ax.set_ylabel()
ax.legend(labels)
plt.savefig()
plt.show() or fig or fig.tight_layout() # To show the figure. 
```

#### Seaborn Plots

``` python
# Settting basic configurations for the plots.
matplotlib.use("TkAgg") #Change the backend of matplotlib, helpful to use GUI plot, default is: 'module://matplotlib_inline.backend_inline'
%matplotlib inline # To view plots in nb itself.
sns.set(rc={'figure.figsize':(10,8),'figure.dpi':200}) # Set the plot configurations of all params from matplotlib.rc_params(), sns.plotting_context(), sns.axes_style()

# Doing the plotting 
fig,ax = plt.subplots() # use only if more than 1 axes are required, then instead of ax we have to use ax[index] or in sns.plot_type specify the axis to plot the figure on.

ax = sns.plot_type(params) #Configuring the plot itself.
ax.... # All the variations that require to be done on the axes like ticks formatting, legend formatting etc.


# Showing the plot
plt.show()
```


#### or
``` python
rcparams = {'figure.figsize':(13,9), 'axes.titlesize':12, 'axes.labelsize':9, 'xtick.labelsize':9, 'ytick.labelsize':9, 'legend.fontsize':12,'figure.dpi': 200}
sns.set(rc=rcparams)
#sns.set_palette('husl')

education_level = df['parental_level_of_education'].unique()
color = sns.color_palette('Set1',len(education_level)) # create a palette with 2 colors (red and blue)
hue_colors = {}

for i, level in enumerate(education_level): # create a dictionary with the colors for each category
    hue_colors[level] = color[i]

fig,axs = plt.subplots(1,3,figsize=(25,6))

sns.histplot(data=df,x='average',kde=True,hue='parental_level_of_education',ax = axs[0],palette=hue_colors,legend=False)
axs[0].set_title('Average distribution of Marks for all students')
axs[0].set_xlabel('Average Score')

sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='parental_level_of_education',ax = axs[1],palette=hue_colors,legend=False)
axs[1].set_title('Average distribution of Marks for all male students')
axs[1].set_xlabel('Average Score')

sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='parental_level_of_education',ax = axs[2],palette=hue_colors,legend=False)
axs[2].set_title('Average distribution of Marks for all female students')
axs[2].set_xlabel('Average Score')

fig.legend(loc='upper left', bbox_to_anchor=(0.1,1.15), ncol=2,labels = hue_colors.keys(),title='Parental Education Level')


#fig.tight_layout()
plt.show()
```

- Seaborn
``` python
tips = sns.load_dataset('name') #Predatasets in sns library itself.
```
##### Distribution Plots

- Distribution Plots
    - Distplot shows the distribution of a univariate variable, by combining histograms and kernel density estimate, which is a pdf of the univariate variable.

``` python
    sns.distplot(data,kde = 'bool', bins, palette)
```
- Jointplot
    - Jointplot allows to create a graph of a bivariate data, where the bivariate plot is in the middle and on the axes we have 2 distplots.

``` python
sns.jointplot(x,y,data = tips,kind, palette)
#x and y can be tips.colname so we can even remove data
#kind: scatter, reg, hex, resid, kde 
```

- Pairplot
    - Pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns).
    - Across diagonal you see either histogram or kde, and rest are filled with scatterplots 

``` python    
sns.pairplot(data,hue,palette,diag_kind)
```

##### Categorical Data Plots

- Barplot and countplots
    - Barplots find the central tendency for a numeric variable. X here can be just a separation of the numeric variable in categories, like sex.
    - Countplot, simply counts the number of occurences for a single variable.
``` python
sns.barplot(x,y,data) 
```
- Boxplot:
    - Shows the 5 point statistics of a numericvariable divided by categories.
``` python    
sns.boxplot(x,y,data,palette,orient,hue)
```

##### Matrix Plots
- Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate clusters within the data (later in the machine learning section we will learn how to formally cluster data).

``` python
sns.heatmaps(pvtdata or corr data, cmap = 'coolwarm')
```

##### Regression Plots
- Simple regression plots to predict a target variable by another feature variable.
``` python
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',row='smoker',hue='sex', palette='coolwarm', aspect=1.5,size=8)
```

## Airflow

- Airflow's extensible Python framework enables you to build workflows connecting with virtually any technology. 
- A web interface helps manage the state of your workflows. 
- Airflow is deployable in many ways, varying from a single process on your laptop to a distributed setup to support even the biggest workflows.
- Dynamic: Airflow pipelines are configured as Python code, allowing for dynamic pipeline generation.
- Extensible: The Airflow framework contains operators to connect with numerous technologies. 
- All Airflow components are extensible to easily adjust to your environment.
- Flexible: Workflow parameterization is built-in leveraging the Jinja templating engine.

## Jupyter lab n Notebooks

### Installing jupyter bit n bytes

- Tips 
    - Use conda to manage envs
    - Use pip mostly to install packages
- Selective conda installs
    - nodejs
``` bash
    conda install nodejs
    conda upgrade -c conda-forge nodejs #To upgrade to latest version supported by jupyter lab extensions
```

## Pandas & Numpy

### Reading a df and manipulation a bit

- Reading a dataframe
```Python
df = pd.read_csv("data.csv",index = False)
```

- Renaming columns
```Python
df = df.rename(columns = {'old_name':'new_name','old_name2':'new_name2','old_name3':'new_name3'})
```

- Reset the index of pandas
```Python
df = df.reset_index()
```

- Setting up column dtypes manually
```Python
df = pd.read_csv('abc.csv',parse_date = ['Date'])
df.info()

# or
df = pd.read_csv('abc.csv',index = False)
df['Date'] = pd.to_datetime(df['Date'])
df.info

```

- Use string methods
``` python
# use
df['Name_Uppercase'] = df['Name'].str.upper()


# instead of
df['Name_Uppercase'] = df['Name'].apply(lambda x: str(x).upper())
df.head()
```

### Analysis

- Check null and Dtypes
``` python
df.info()
```

- Check the number of unique values of each column
``` python
df.nunique()
```

- Check statistics of the data
```Python
df.describe()
```

- Indexing the columns
``` python
df.loc[r1_name:r10_name,c1_name:c5_name]
# The last index is used in loc.

# or
df.iloc[1:10m,1:5]
# The last index is not used in iloc

# reverse the rows
df.loc[::-1].head()

# reverse the columns 
df.loc[:,::-1].head()
```

- Seperating numerical and categorical columns
``` python
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']
```

### Conditional filtering and Querying

- Counting
```python
# Find the count by a categorical attribute
df.col_name.value_counts()
```

- Duplicate count
``` python
df.duplicated().sum()
```

- Conditional filtering and Querying
```python
df[( df["Embarked"] == 'S') & (df["Sex"] == 'female') & (df["Survived"] == 1) ].head()
```

- Querying the columns
```Python
min_year = 1980
min_time = 10
df = df.query("Year < @min_year and Time > @min_time")
df
```

- Iterating over the rows of dataframe via broadcasting, i.e. vectorized function
```Python
df['result'] = df['Year'] > 1980
df
```

- Instead of apply method use the vectorized version if possible.
```Python
# Use
df['year_square'] = df['year'] ** 2
df.head()
# Instead of 
df['year_square'] = df.apply(lambda row: row["year"]**2, axis = 1)
df.head()
```

### NA Handling

- Fill na, and never use inplace as it's not a good method.
```Python
df = df.fillna(0)
```

- Calculate Null Values
``` python
df.isna().sum()
```

- Dropping Null Columns
``` python
df.dropna( axis='index').info() #Index axis is x axis we can use 0 as well, and columns as axis is same as using 1
```

- Fill with some values 
```python
df["Cabin"].fillna("Default Cabin" , inplace = True)
```

### Noob mistakes

- Create a copy of df, so that it doesn't effect the original, especially for the slice of df
```Python
df_fast = df.query('Time < 10').copy() # A Slice of DS
df_fast['country'] = df_fast['Name'].str[-5:]
df_fast
```

- Use chaining commands to apply all changes to df at once.
```Python
# Use
df_out = (df.query('Year > 1975')
          .groupby(['Athlete'])[['Time']].min()
          .sort_values('Athlete')
         )
df_out

# Instead of 
df2 = df.query('Year > 1975')
df3 = df2.groupby(['Athlete'])[['Time']].min()
df_out = df3.sort_values('Athlete')
df_out
```
- Use inbuilt pandas plotting functionality
```Python
ax = df.plot(kind = 'scatter',
            x = 'Year',
            y = 'Time',
            title = 'Year vs Speed'
            )
```
- Create functions to process data instead of processing them manually each time.
``` python
def process_data(df):
    df['Time_Norm'] = df['Time']/df['Time'].mean()
    df['Place'] = df['Place'].str.lower()
    return df

dfm = process_data(dfm)
dfw = process_data(dfw)
dfw
```

- Grouping of data
``` python
df.groupby('Grouping')['Time'].min()

# Looping over the data of the dataframe
df.groupby('Grouping')['Time'].agg(['mean','count'])
```

- Saveing large data
``` python
# use
large_df.to_parquet('op.parquet')
large_df.to_feather('op.feather')
large_df.to_pickle('op.pickle')
# instead of 
large_df.to_csv("abc.csv")
```

- Conditional formatting in pandas with html styling
``` python
df.sort_values( 'Time' ).head(10)[['Name','Time']] \
.reset_index(drop=True) \
.style \
.background gradient(cmap="Reds")
```

- Giving suffixes while merging 2 df's with validation
``` python
df1 = pd.read_csv( 'mens100m.csv' )
df2  = pd.read_csv( 'womens100m.csv' )
df_merged = df1.merge(df2,on=['Year'],suffixes = ('_mens','_womens'),validate = 'm:1')

#  validate options:
# "one to one" or "1: 1": check if merge keys are unique in both datasets.
# "one to many" or "l:m": check if merge keys are unique in left dataset.
#"many to one" or "m: 1": check if merge keys are unique in right dataset.
# "many to many" or "m:m": allowed, but does not result in checks.
```

- Wrapping the chaining components well, for readability
``` python
df_agg = (
    df
    .groupby(['Grouping','Year'])['Time']
    .min()
    .reset_index()
    .fillna(0)
    .sort_values("Year")
    
)
df_agg
```

- Making columns categorical which has categorical data for speeding up the process
``` python
df['Grouping'] = df['Grouping'].astype('category')
```