# <font color="darkblue">Python Activity</font>

Within this Jupyter Notebook, there are several resources as well as guided questions for you to answer while you're going through the ML Seminar. Before starting, however, it's important that you have `pip` installed. With `pip`, you should download the following packages in your command line if you haven't done so already
- NumPy (`pip install numpy`)
- Pandas (`pip install pandas`)
- Scikit-Learn (`pip install scikit-learn`)
- Keras (`pip install keras`)
- Plotly (option 1 for visualization) (`pip install plotly`)
- Seaborn (optiion 2 for visualization) (`pip install seaborn`)

In [2]:
# Import the necessary libraries before starting.


# <font color = "lightblue">1. Data Preprocessing and Feature Engineering</font>
### Getting Started with Data Preprocessing: Pandas and NumPy
Raw data, prior to data preprocessing, is typically not ready to be fed into any machine learning algorithm. This is because the raw data that you are provided is often incomplete, messy, noisy, and inconsistent. For example, if we were to have a `date` column, it's completely possible that ome entries within your dataframe may be inconsistent with different formatting (i.e. `dd/mm/yyyy`, `mm.dd.yyyy`, `yyyy mm-dd`). As another example, there may be missing data within your table because of unrecorded measurements or sensors not working properly. Also, there may be statistical outliers that need to be dealt with (i.e. if you're measuring the price of apartments/flat per month in Heidelberg, an outlier could be one that's 25000 EUR a month). Whatever issues your dataset has, you need to handle them accordingly, which is what data preprocessing is.

Whereas data preprocessing deals with fixing your data to be a usable format, the idea of _feature engineering_ is to make better use of your pre-existing columns, which may (or may not) help your machine learning algorithm and help you develop better initial insights. For example, say that we preprocessed our aforementioned `date` column. Feature engineering would mean creating brand new columns such as `is_weekend` or `is_holiday` stemming directly from that `date` column.

Typically, when using Python for data science, two of the most important packages are `pandas` and `NumPy`. `Pandas` deals primarily with dataframes and allows you to do things such as load in, manipulate, and transform your data. This framework is closely connected to `NumPy`, which is used for numerical computations within your dataframe.

### Resources
- Pandas Documentation: https://pandas.pydata.org/about/citing.html
- In-depth Pandas Tutorials (Text/Code): https://www.geeksforgeeks.org/pandas-tutorial/?ref=lbp
- Essential Pandas (Text/Code): https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
- In-depth Pandas Tutorials (Video/Code): https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS
- Pandas Cheat Sheet:
![Pandas Cheat Sheet](./assets/pandas_cheatsheet.png)

### Examples
- Best example: https://towardsdatascience.com/30-examples-to-master-pandas-f8a2da751fa4
- Missing values, outliers, numeric/categorical columns, interpolation: https://towardsdatascience.com/data-preprocessing-with-python-pandas-part-1-missing-data-45e76b781993
- Missing values: https://medium.com/@arpitpathak114/data-preprocessing-with-numpy-and-pandas-5598ef69491e


### Questions to Consider:
- Are there any missing values or outliers within your dataset? If so, do you want to delete these rows , impute the values, or cap them to be a particular number?
- Are there any inconsistincies within your data?
- Have you one-hot encoded any categorical variables that you may have?
- Are there any seemingly useless columns within your dataset?
- Are there any duplicated rows? Duplicated columns?

# <font color = "lightblue">2. Exploratory Data Analysis (EDA)</font>

### Getting Started with EDA: Graphing Libraries
Within Python, there are typically two graphing libraries that are most commonly used: `Plotly` and `Seaborn`. The benefit of using `Plotly` is that it's quite interactive however `Seaborn` (which in its backend uses the graphing library `Matplotlib`) is used by more Machine Learning Engineers, Data Scientists and Bioinformaticians. Personally, I like to use `Plotly` more often for the interactiveness however, feel free to play around with both and use whichever you prefer.

Getting Started with `Plotly`: https://www.youtube.com/watch?v=GGL6U0k8WYA&ab_channel=DerekBanas

Getting Started with `Seaborn` and `Matplotlib`: https://www.youtube.com/watch?v=6GUZXDef2U0&ab_channel=DerekBanas

Examples of `Seaborn` and their `Plotly` equivalents: https://analyticsindiamag.com/plotly-vs-seaborn-compari/

### "What's the point of EDA?"
DESCRIPTION OF EDA HERE.

### Final Remarks
FINAL REMARKS OF EDA HERE.

### EDA Examples
- Overview of EDA: https://www.ncbi.nlm.nih.gov/books/NBK557570/
- Pandas EDA: https://datascientyst.com/exploratory-data-analysis-pandas-examples/
- Best example of EDA: https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python

# <font color = "lightblue">3. Modeling</font>
### Getting Started with Modeling: Scikit-Learn and Keras
PLACEHOLDER

### "What's the point of Modeling?"
PLACEHOLDER

### Final Remarks
PLACEHOLDER

### Modeling Resources
- Introduction to Statistical Learning in R: https://hastie.su.domains/ISLR2/ISLRv2_website.pdf
    - Quick Note: Although this book is based in the R coding language, this book provides <b>excellent</b> (I can't stress this enough) easy-to-understand examples while still going over some of the theory. If you'd like the Python-equivalent of these exercises, feel free to look at this: https://github.com/shilpa9a/Introduction_to_statistical_learning_summary_python
    - Most relevant chapters:
        - Chapter 2: Statistical Learning (must read)
        - Chapter 3: Linear Regression
        - Chapter 4: Classification (Logitstic Regression)
        - Chapter 5: Resampling Methods
            - Particularly take a look at Chapter 5.1 Cross-Validation
        - Chapter 8: Tree-Based Methods (Decision Trees and Random Forests)
        - Chapter 9: Support Vector Machines
        - Chapter 10: Deep Learning
        - Chapter 11: Survival Analysis
        - Chapter 12: Unsupervised Learning (PCA, K-Means, and Hierarchical Clustering)
- Introduction to Keras (Neural Networks): https://towardsdatascience.com/introduction-to-deep-learning-with-keras-17c09e4f0eb2
- Getting Started with Keras (Neural Networks): https://machinelearningmastery.com/introduction-python-deep-learning-library-keras/

# <font color = "lightblue">4. Results</font>
### "What's the point of showing results?"
PLACEHOLDER

### Final Remarks
PLACEHOLDER

### Results Visualization Resources
- Regression and Classifiation Metrics: https://machinelearninghd.com/sklearn-metrics-classification-regression/
- Scikit-Learn Documentation: https://scikit-learn.org/stable/modules/model_evaluation.html
- A Data Scientist's Guide to Communicating Results: https://medium.com/comet-ml/a-data-scientists-guide-to-communicating-results-c79a5ef3e9f1
    - Quick note: Don't need to use Comet.ml 
- Communicating Results: https://insidebigdata.com/2018/03/28/data-scientists-guide-communicating-results/
- Great Data Science Report Example: https://www.kaggle.com/code/startupsci/titanic-data-science-solutions

# <font color = "lightblue">5. Conclusions</font>
### "What's the point of a conclusions section?"
PLACEHOLDER

### Final Remarks
PLACEHOLDER

# <font color="darkblue">Wrapping Up</font>
PLACEHOLDER.

# Sources
1. 