IEEE ICMLA 2019 Data Science Tutorial

Code for the IEEE ICMLA (International Conference on Machine Learning and Applications) The Data Science landscape: foundations, tools, and practical applications session.

Quick "get started" guide:

Clone this repository
cd to the repository's directory
Optional: create a Python virtual environment
1. python3 -m venv env
2. source env/bin/activate (Windows: env\Scripts\activate.bat)
3. python -m pip install --upgrade pip
pip install -r requirements.txt
jupyter lab (or use Visual Studio Code's Jupyter extension)

1. Exploratory data analysis (EDA)

Notebook 1 is about understanding the pieces of information we have in the dataset, and being confident it is not missing values and that each column has, in general, usable values (a few of them may need to be cleaned up - we will deal with that later).

We used:

Pandas to read the data into well-structure DataFrame.
shape, columns, dtypes, and head() to investigate the basic structure of the dataset.
isnull() to check for missing values.
describe() and unique() to verify that columns are consistent with what we expect them to be.

2. Statistics and data science

Notebook 2 describes how to clean up a dataset, removing outliers that are not relevant for the analysis. To do that we have to first understand the domain of the data we were using (working-age population). We also remove attributes (columns) that were not relevant for the analysis.

Once we had a clean dataset, we collected enough evidence to call for action on possible gender discrimination by using:

seaborn histplot() to review the distribution of dataset attributes.
Box plots, with seaborn boxplot(), to inspect details of an attribute's distribution: its quartiles and outliers.
Pandas DataFrame masks to filter out rows. For example, to remove employes over a certain age, or below an education level.
seaborn pairplot() to view the relationship of all attributes of a dataset at a glance.
Pandas' cut() to bin (group) attributes into larger categories.

3. Using data to answer questions

Notebook 3 uses permutations of a dataset with np.random.permutation() to test hypotheses

To prove (or disprove) a hypothesis, we:

Inspected the dataset with shape, columns, describe(), and info()
Checked for possible duplicated entries with nunique().
Performed a domain check (a suspiciously low literacy rates), to verify if the data make sense. We found out that it matches a reliable source.
To make the code clearer, we split out of the dataset only the pieces of information we need and transformed some pieces of data into a more convenient format (fertility and illiteracy).
Established that there is a correlation visually (with a scatter plot) and formally (with the Pearson correlation coefficient).
Once we confirmed that there is a correlation, we performed a large number of experiments to check if the correlation exists by chance (with np.random.permutation()).
To make our experiments reproducible, we set a seed for the pseudorandom generator (np.random.seed(42)).

4. Machine learning and data science

Notebook 4 uses machine learning to build a model that achieved over 80% accuracy with a few lines of code and without resorting to feature engineering or other transformations.

Along the way we also:

Verified that the dataset is imbalanced and adjusted the code accordingly (value_counts()).
Used stratified sampling to split the dataset and preserve the class ratios (train_test_split(..., stratify=...)).
Used precision and recall to understand where the model makes mistakes (classification_report()).
Visualized the mistakes with a confusion matrix (confusion_matrix()).
Established a baseline with a simple model.
Switched to a more complex model, improving the baseline results.
Found an even better model with grid search (GridSearchCV()).

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
data		data
experiments		experiments
.gitignore		.gitignore
1-exploratory-data-analysis.ipynb		1-exploratory-data-analysis.ipynb
2-statistics-and-data-science.ipynb		2-statistics-and-data-science.ipynb
3-using-data-to-answer-questions.ipynb		3-using-data-to-answer-questions.ipynb
4-machine-learning-and-data-science.ipynb		4-machine-learning-and-data-science.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

experiments

experiments

.gitignore

.gitignore

1-exploratory-data-analysis.ipynb

1-exploratory-data-analysis.ipynb

2-statistics-and-data-science.ipynb

2-statistics-and-data-science.ipynb

3-using-data-to-answer-questions.ipynb

3-using-data-to-answer-questions.ipynb

4-machine-learning-and-data-science.ipynb

4-machine-learning-and-data-science.ipynb

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

IEEE ICMLA 2019 Data Science Tutorial

1. Exploratory data analysis (EDA)

2. Statistics and data science

3. Using data to answer questions

4. Machine learning and data science

More on this topic

About

Releases 2

Packages

Languages

License

fau-masters-collected-works-cgarbin/ieee-icmla-2019-data-science-tutorial

Folders and files

Latest commit

History

Repository files navigation

IEEE ICMLA 2019 Data Science Tutorial

1. Exploratory data analysis (EDA)

2. Statistics and data science

3. Using data to answer questions

4. Machine learning and data science

More on this topic

About

Topics

Resources

License

Stars

Watchers

Forks

Languages