# Data Science Fundamentals
## Predicting Horsepowers: A Practical Toy Example with the `automg` Dataset

Â© 2025, 2026 Yvan Richard.   
*All rights reserved.*

This notebook introduces core concepts in data science and machine learning through structured exploration of a real-world dataset, focusing on data meaning, structure, and limitations rather than model optimisation.

## Table of Contents
1. <a href = "#part1">Motivation and Scope</a>     
2. <a href = "#part2">Data Loading, Description, and Initial Inspection</a>


## <a id = "part1" style = "color: inherit;">1. Motivation and Scope</a>



### 1.1. Data Science and Machine Learning

As outlined in the introduction, the role of a data scientist can be broadly characterised as the extraction of meaningful insights from complex and often imperfect data. When the objective is to understand or predict relationships between variables, this process typically takes the form of *statistical learning* or *machine learning*. In this setting, the analyst seeks to approximate an unknown function $ f $ that maps a set of predictors $ X = \{X_1, X_2, \ldots, X_p\} $ to a response variable $ Y $, up to an irreducible random error term $ \varepsilon $, assumed to be independent of $ X $ and to have zero mean:

$$
Y = f(X) + \varepsilon
$$

This formulation provides a unifying framework for a wide range of predictive and inferential tasks encountered in applied data science.

### 1.2.  The Early Steps

The objective of this notebook is to illustrate the stages that precede formal modelling in a data science or machine learning workflow. In practice, the majority of time and effort in applied projects is not devoted to model estimation, but rather to data acquisition, cleaning, validation, and exploratory data analysis (EDA). These preliminary steps are critical, as they determine both the feasibility of downstream modelling and the reliability of any conclusions drawn from it.

To ground this discussion in a concrete example, this notebook works with the well-known *Auto MPG* dataset, which records technical characteristics of automobiles and their fuel efficiency. While the dataset is modest in size, it exhibits several features typical of real-world data, including measurement heterogeneity, missing values, and variables whose interpretation requires domain understanding.

The analysis deliberately avoids model optimisation or performance evaluation. Instead, the focus is on developing intuition about the data-generating process, understanding the meaning and limitations of the available variables, and identifying potential challenges that must be addressed before any modelling exercise can be undertaken. In doing so, the notebook aims to provide a simplified but representative illustration of how the early stages of a data mining or machine learning pipeline are typically structured in practice.

## <a id = "part2" style = "color: inherit;"> 2. Data Loading, Description, and Initial Inspection</a>