Skip to content

fau-masters-collected-works-cgarbin/ieee-icmla-2019-data-science-tutorial

Repository files navigation

IEEE ICMLA 2019 Data Science Tutorial

Code for the IEEE ICMLA (International Conference on Machine Learning and Applications) The Data Science landscape: foundations, tools, and practical applications session.

Quick "get started" guide:

  1. Clone this repository
  2. cd to the repository's directory
  3. Optional: create a Python virtual environment
    1. python3 -m venv env
    2. source env/bin/activate (Windows: env\Scripts\activate.bat)
    3. python -m pip install --upgrade pip
  4. pip install -r requirements.txt
  5. jupyter lab (or use Visual Studio Code's Jupyter extension)

1. Exploratory data analysis (EDA)

Notebook 1 is about understanding the pieces of information we have in the dataset, and being confident it is not missing values and that each column has, in general, usable values (a few of them may need to be cleaned up - we will deal with that later).

We used:

2. Statistics and data science

Notebook 2 describes how to clean up a dataset, removing outliers that are not relevant for the analysis. To do that we have to first understand the domain of the data we were using (working-age population). We also remove attributes (columns) that were not relevant for the analysis.

Once we had a clean dataset, we collected enough evidence to call for action on possible gender discrimination by using:

  • seaborn histplot() to review the distribution of dataset attributes.
  • Box plots, with seaborn boxplot(), to inspect details of an attribute's distribution: its quartiles and outliers.
  • Pandas DataFrame masks to filter out rows. For example, to remove employes over a certain age, or below an education level.
  • seaborn pairplot() to view the relationship of all attributes of a dataset at a glance.
  • Pandas' cut() to bin (group) attributes into larger categories.

3. Using data to answer questions

Notebook 3 uses permutations of a dataset with np.random.permutation() to test hypotheses

To prove (or disprove) a hypothesis, we:

  • Inspected the dataset with shape, columns, describe(), and info()
  • Checked for possible duplicated entries with nunique().
  • Performed a domain check (a suspiciously low literacy rates), to verify if the data make sense. We found out that it matches a reliable source.
  • To make the code clearer, we split out of the dataset only the pieces of information we need and transformed some pieces of data into a more convenient format (fertility and illiteracy).
  • Established that there is a correlation visually (with a scatter plot) and formally (with the Pearson correlation coefficient).
  • Once we confirmed that there is a correlation, we performed a large number of experiments to check if the correlation exists by chance (with np.random.permutation()).
  • To make our experiments reproducible, we set a seed for the pseudorandom generator (np.random.seed(42)).

4. Machine learning and data science

Notebook 4 uses machine learning to build a model that achieved over 80% accuracy with a few lines of code and without resorting to feature engineering or other transformations.

Along the way we also:

  • Verified that the dataset is imbalanced and adjusted the code accordingly (value_counts()).
  • Used stratified sampling to split the dataset and preserve the class ratios (train_test_split(..., stratify=...)).
  • Used precision and recall to understand where the model makes mistakes (classification_report()).
  • Visualized the mistakes with a confusion matrix (confusion_matrix()).
  • Established a baseline with a simple model.
  • Switched to a more complex model, improving the baseline results.
  • Found an even better model with grid search (GridSearchCV()).

More on this topic

If you found this repository useful, you may also want to checkout these repositories: