## 1.1 What Is This Book About?
This book is concerned with the nuts and bolts of manipulating, processing, cleaning,
and crunching data in Python. My goal is to offer a guide to the parts of the Python
programming language and its data-oriented library ecosystem and tools that will
equip you to become an effective data analyst. While “data analysis” is in the title
of the book, the focus is specifically on Python programming, libraries, and tools as
opposed to data analysis methodology.


## 1.2 Why Python for Data Analysis?

Since its
first appearance in 1991, Python has become one of the most popular interpreted
programming languages, along with Perl, Ruby, and others. Python and Ruby have
become especially popular since 2005 or so for building websites using their numer‐
ous web frameworks, like Rails (Ruby) and Django (Python). Such languages are
often called *scripting* languages, as they can be used to quickly write small programs,
or *scripts* to automate other tasks.

For data analysis and interactive computing and data visualization, Python will inevi‐
tably draw comparisons with other open source and commercial programming lan‐
guages and tools in wide use, such as R, MATLAB, SAS, Stata, and others. In recent
years, Python’s improved open source libraries (such as pandas and scikit-learn) have
made it a popular choice for data analysis tasks. Combined with Python’s overall
strength for general-purpose software engineering, it is an excellent option as a
primary language for building data applications.

## 1.3 Essential Python Libraries

### Numpy
**NumPy**, short for Numerical Python, has long been a cornerstone of numerical
computing in Python. It provides the data structures, algorithms, and library glue
needed for most scientific applications involving numerical data in Python.

### Pandas
**pandas** provides high-level data structures and functions designed to make working
with structured or tabular data intuitive and flexible. Since its emergence in 2010, it
has helped enable Python to be a powerful and productive data analysis environment.
The primary objects in pandas that will be used in this book are the DataFrame, a
tabular, column-oriented data structure with both row and column labels, and the
Series, a one-dimensional labeled array object.

### matplotlib
**matplotlib** is the most popular Python library for producing plots and other two-
dimensional data visualizations. It was originally created by John D. Hunter and

is now maintained by a large team of developers. It is designed for creating plots
suitable for publication. While there are other visualization libraries available to
Python programmers, matplotlib is still widely used and integrates reasonably well
with the rest of the ecosystem. I think it is a safe choice as a default visualization tool.

### IPython and Jupyter
**IPython** is designed
for both interactive computing and software development work. It encourages an
execute-explore workflow instead of the typical edit-compile-run workflow of many
other programming languages. It also provides integrated access to your operating
system’s shell and filesystem; this reduces the need to switch between a terminal
window and a Python session in many cases. Since much of data analysis coding
involves exploration, trial and error, and iteration, IPython can help you get the job
done faster.

### SciPy
SciPy is a collection of packages addressing a number of foundational problems in
scientific computing.
* `scipy.integrate`
* `scipy.linalg`
* `scipy.optimize`
* `scipy.signal`
* `scipy.sparse`
* `scipy.special`
* `scipy.stats`


### scikit-learn
Since the project’s inception in 2007, scikit-learn has become the premier general-
purpose machine learning toolkit for Python programmers. As of this writing, more

than two thousand different individuals have contributed code to the project.

### statsmodels
statsmodels is a statistical analysis package that was seeded by work from Stanford
University statistics professor Jonathan Taylor, who implemented a number of regres‐
sion analysis models popular in the R programming language. Skipper Seabold and
Josef Perktold formally created the new statsmodels project in 2010 and since then
have grown the project to a critical mass of engaged users and contributors. Nathaniel
Smith developed the Patsy project, which provides a formula or model specification
framework for statsmodels inspired by R’s formula system.

### Other Packages


## 1.4 Installation and Setup

## 1.5 Community and Conferences

## 1.6 Navigating This Book

In [None]:
import numpy as np
import pandas as pd
pd.options.display.max_columns = 20
pd.options.display.max_rows = 20
pd.options.display.max_colwidth = 80
np.set_printoptions(precision=4, suppress=True)

In [None]:
#Import Conventions
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm