# Task 2 Overview and Implementation Readme

## What the task was about
The overarching objective of this assignment was to conduct a comprehensive analysis across diverse datasets, covering tasks ranging from dataset characterization to advanced predictive modeling. The tasks were designed to reinforce our understanding of statistical concepts, machine learning algorithms, and exploratory data analysis.

## Understanding the Bigger Picture
The assignment encompassed several sub-tasks, starting with dataset identification and characterization. Tasks included exploratory and inferential analyses, application of various loss functions, visualizations for comparison, and complex modeling scenarios such as kernel transformations, overfitting, and regularization.

## Mathematics and Statistics Involved
The project heavily relied on statistical concepts and mathematical techniques. Exploratory analyses involved descriptive statistics, correlation analyses, and visualizations. Predictive modeling utilized regression and classification algorithms, evaluating performance metrics like accuracy, R2, precision, and confusion matrices. for e.g. Multiple regression. y = b1x1 + b2x2 + … + bnxn + c, and # Linear regression. Y=β0+β1X+ϵ
where popular throughout my assignment

## Implementation Details
Libraries such as NumPy, Pandas, Matplotlib, and scikit-learn were instrumental in the implementation. For kernel transformations, Support Vector Machines (SVM) with Radial Basis Function (RBF) kernels were applied. The logic behind choosing specific algorithms aligned with the characteristics of each dataset and task. References were cited when leveraging external resources, ensuring a robust foundation for our analyses.

## Outcomes of Each Sub Task
Detailed interpretations and explanations accompanied each sub-task's outcomes. This included insights gained from exploratory analyses, the impact of different loss functions on models, the effectiveness of kernel transformations, and the effects of overfitting and regularization.

## Challenges and Resolutions
I encountered several challenges that required thoughtful consideration and creative problem-solving to ensure the successful completion of each sub-task.

1. **Understanding the Data:** The datasets I dealt with were like intricate puzzles, each containing diverse pieces of information. Figuring out how everything fit together posed a challenge. I delved deep into the dataset documentation and employed exploratory data analysis techniques to unravel the underlying patterns and relationships within the data.

2. **Picking the Right Tools:** Selecting the most suitable algorithms for regression and classification tasks was a critical decision. This challenge was overcome by conducting a comparative analysis of different algorithms, taking into account factors such as dataset size, linearity, and multicollinearity.

3. **Choosing the Right Features:** Selecting the appropriate features for modeling, especially in scenarios of overfitting, required careful consideration. I addressed this challenge through iterative experimentation with different subsets of features, evaluating their impact on model performance.

## Acknowledgments and References

- **(i)** Iterate over unique levels of 'UsageGroup': [Pandas Documentation](https://pandas.pydata.org/docs/user_guide/groupby.html)
- **(i)** Encode categorical variables: [Stack Overflow](https://stackoverflow.com/questions/44474570/sklearn-label-encoding-multiple-columns-pandas-dataframe)
- **(ii)** Pruning a Decision Tree: [Stat Infer](https://statinfer.com/204-3-10-pruning-a-decision-tree-in-python/)
- **(i)** Syntax for printing unique values: [Stack Overflow](https://stackoverflow.com/questions/27241253/print-the-unique-values-in-every-column-in-a-pandas-dataframe)
- **(ii)** Syntax for Converting to numeric: [GitHub](https://github.com/pandas-dev/pandas/issues/17007)
- **(iii)** random.normal numpy function: [NumPy Documentation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html)

## Further Reading

1. *Practical Statistics for Data Science*, by Peter Bruce, Andrew Bruce and Peter Gedeck - [Link](https://ebookcentral.proquest.com/lib/UAL/detail.action?docID=6173908#)
2. Grus, J. (2019). *Data Science from Scratch: First Principles with Python*. O'Reilly Media, Inc.
4. Stephen Marsland: *Machine Learning: An Algorithmic Perspective (Chapman & Hall/CRC Machine Learning & Pattern Recognition)*
6. Eldén, L. (2007). *Matrix Methods in Data Mining and Pattern Recognition*. United States: Society for Industrial and Applied Mathematics
7. *Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares* by Stephen Boyd and Lieven Vandenberghe - [Link](https://amzn.eu/d/3knw6E3)
8. Field, A. (2013). *Discovering statistics using (IBM SPSS statistics or R)*. sage.
9. Robert Stengle: *Optimal Control and Estimation* - [Link](https://www.amazon.co.uk/Optimal-Control-Estimation-Dover-Mathematics/dp/0486682005)
10. Sebastian Thrun, Wolfram Burgard, Dieter Fox: *Probabilistic Robotics* - [Link](https://amzn.eu/d/gS71bM9)



**Special thanks to Dr. Kayalvizhi Jayavel for her invaluable assistance and guidance throughout this assignment. Additionally, we express gratitude to Prof. Tim Smith for providing engaging research and data, enriching our learning experience.**