Skip to content

Data science with Pandas and NumPy: EDA, binning, distribution functions, simulations, regression analysis

License

Notifications You must be signed in to change notification settings

fau-masters-collected-works-cgarbin/cap5768-introduction-to-data-science

Repository files navigation

cap5768-introduction-to-data-science

CAP5768 Introduction to Data Science Fall 2019

Credits:

Quick "get started" guide:

  1. Clone this repository
  2. cd to the repository's directory
  3. Optional: create a Python virtual environment
    1. python3 -m venv env
    2. source env/bin/activate (Windows: env\Scripts\activate.bat)
    3. python -m pip install --upgrade pip
  4. pip install -r requirements.txt
  5. jupyter lab
  6. Navigate to the assignments and open the notebook

If you found this repository useful, you may also want to check out these repositories:

Assignment 1

Assignment 1 is a warm-up exercise to get used to NumPy and Pandas in a Jupyter environment.

Covered in this assignment:

  • Get a Jupyter Lab environment up and running (see this wiki page)
  • Learn and apply NumPy and Pandas (see notes on this wiki page)
  • Read data from .csv files
  • Uses Pandas DataFrame and Series filter and aggregation functions
  • Find correlations analytically with Pearson correlation coefficient and aggregated statistics
  • Find correlations with graphs (Seaborn regplot)
  • Formulate and verify hypotheses with analytical (e.g. aggregation by type) and graphical support (e.g. box plots)

Assignment 2

Assignment 2 is about exploring datasets (manipulate, summarize, visualize).

Covered in this assignment:

  • Combine multiple datasets into one, using common fields across the datasets
  • Summarize, filter and sort data with pivot_table, groupby, query, eval and other functions
  • Visualize datasets with Matplotlib and DataFrame.plot
  • More formulation and verification of hypotheses with analytical (e.g. aggregation by type) and graphical support

Assignment 3

Assignment 3 is when we switched from analytics to statistics, the first chapters of Think Stats 2.

Covered in this assignment:

  • Selecting the number of bins for histograms ("binning bias")
  • Beyond histograms: swarm plots and box plots
  • Cumulative distribution function (CDF)
  • Correlation with Pearson and Spearman functions

Assignment 4

Assignment 4 explores modeling: how to identify the type of a distribution and its parameters, validate the distribution against a theoretical model, then use the model to answer questions about the distribution. Part 1 of DataCamp's Statistical Thinking in Python was a helpful resource.

Covered in this assignment:

  • PDF and CDF of exponential distributions
  • Find the type and parameters of an empirical distribution
  • Use the parameters to simulate the distribution and answer questions about it
  • Calculate moments and skewness

Assignment 5

Assignment 5 covers hypotheses testing with simulations and p-value calculation. Cassie Kozyrkov's Statistics for people in a hurry was very helpful in understanding the concepts of what we are attempting here, especially the meaning of the null hypothesis.

Covered in this assignment:

  • Permutate the empirical data to run experiments (using numpy.random.permutation())
  • Decide what pieces of the dataset we need to permutate (all of them, only one?)
  • Calculate the p-value from the experiments
  • Interpret the p-value

Assignment 6

Assignment 6 covers regression analysis: how to add polynomial features, then perform linear regression on the enhanced dataset, evaluate results with R2 score and regularize with Ridge and Lasso.

Covered in this assignment:

  • Perform linear regression with Numpy polyfit()
  • Add features to improve fitting with PolynomialFeatures()
  • Perform Linear regression with scikit-learn LinearRegression()
  • Perform all steps together with a pipeline
  • Regularize with Ridge and Lasso to prevent overfitting
  • Use RidgeCV() and LassoCV() for hyperparameter search
  • Evaluate the linear regression results with the R2 score
  • Choose an optimal polynomial degree by comparing R2 score
  • To not trust only the summary statistics (Anscombe's quartet)

Final project

In the final project we review concepts we learned early in the course, and use the techniques we learned later.

Covered in the final project:

  • DataFrame describe(), to view summary statistics at a glance. All the important values are available with one function call.
  • How much we get out of the box from the ydata-profiling package. It is like a "mini EDA" with one line of code.
  • The verbose parameter to follow the progress of long-running scikit-learn tasks.
  • Pay attention to the random_state parameter in the scikit-learn APIs to get consistent results.
  • How to use seaborn's heatmap for confusion matrix visualization. More specifically, the trick to zero out the diagonal with NumPy fill_diagonal() to make the classification mistakes stand out in the graph.
  • Use GridSearchCV() to find parameters for a classifier.
  • The power and simplicity of Naive Bayes classifiers, even for seemly complex tasks such as digit classification. It can be used as a baseline before attempting more complex solutions.
  • How surprisingly good random forest classifiers perform, achieving 97% in the digit classification without much work. Another case of "try this before pulling your neural network card" case.
  • The small number of components we need to explain variability (the PCA section).
  • Finally getting a chance to play with OpenCV and see first-hand how easy and feature-rich it is.

Releases

No releases published

Packages

No packages published