Skip to content

docmarionum1/python-data-science-primer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Data Science Primer

This primer will take you through some of the tools Python has for data science: mathematical operations, statistics, visualization, machine learning, etc.

I will assume knowledge of python and some basic knowledge of the topics. I won't be delving into the mathematical details of how the tools work. Instead, I will focus on what they do, why you might use them and how to use them.

Table of Contents

Usage

All the content will be in the form of Jupyter notebooks. You can view it all directly on github without installing anything. But I'd recommend playing along to get the most out of it. If you're completely new to all the of tools being introduced, I'd recommend going in the order outlined in this README because I will be building on the content as I go along.

Installation

You need to have python3 installed. If you're on Windows, I highly recommend using Anaconda.

After that open a command line and run:

pip install -r requirements.txt

Now from the root of this repository run:

jupyter notebook

to launch Jupyter which will open a browser window where you can navigate through the files of the repo.

NumPy

First up is NumPy. NumPy, short for Numerical Python, is the foundation of pretty much every mathematical python library. It's primary function is doing matrix operations. It does a lot more, but I will be focusing on the essentials.

The contents of the NumPy notebook are:

  • Arrays
  • Matrices
  • Array Creation Functions
  • Generating Random Arrays
  • Reshape
  • Mathematical Operations
  • Statistics

Matplotlib

Matplotlib is the most commonly used python library for creating 2D-plots. It's API interface is inspired by MATLAB.

The contents of the Matplotlib notebook are:

  • Line Graphs
  • Scatter Plots
  • Combining Plots and Creating Legends
  • Histograms
  • Styling

Pandas

Pandas is a library which provides data structures for doing data analysis. It is similar to having access to an Excel spreadsheet in python.

The contents of the Pandas notebook are:

  • DataFrames
  • Operations and Filtering
  • Merging DataFrames
  • Grouping Rows by Value

StatsModels

StatsModels is a library for running statistical models.

The Regression notebook includes:

  • OLS Linear Regression
  • Using OLS Linear Regression to do Polynomial regression
  • Categorical Variables in OLS Linear Regression

scikit-learn

scikit-learn is a python library for machine learning.

The classification notebook includes:

  • Naive Bayes
  • K-Nearest Neighbors
  • Support Vector Machines
  • Decision Trees
  • Random Forest
  • Evaluating Model Results

The dimensionality reduction notebook includes:

  • Principal Component Analysis (PCA)
  • PCA + Classification

PyBrain

PyBrain is another machine learning library. It has some overlap with scikit-learn, but its major focus is on neural networks.

The PyBrain neural network notebook includes:

  • Function Approximation
  • Classification

Contributing

Contributions are more than welcome - from additional functionality I skipped over to whole new packages I didn't include. Here's a list of things I've already identified that I'd like to add.

License

Code released under the MIT license.

About

A primer for data science tools in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published