## manipulating, processing, cleaning, and crunching data in Python

<span style="color:blue; font-family: monospace">**What Kinds of Data?** </span>   
<span style="font-size:14px; font-family: monospace">*When I say “data,” what am I referring to exactly? The primary focus is on structured
data, a deliberately vague term that encompasses many different common forms of
data, such as:*</span>
- **Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files.**
- **Multidimensional arrays (matrices).**
- **Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user).**
- **Evenly or unevenly spaced time series.**

<span style="color:blue; font-size:20px; font-family: serif; font-weight: bold"> Essential Python Libraries</span>

<span style="color:red; font-size:16px; font-family: monospace">**NumPy**</span>:  
<span style="font-size:14px; font-family: monospace">short for Numerical Python, has long been a cornerstone of numerical computing in Python. It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python. NumPy contains, among other things:</span>

- ```A fast and efficient multidimensional array object ndarray```  
- ```Functions for performing element-wise computations with arrays or mathematical operations between arrays```  
- ```Tools for reading and writing array-based datasets to disk```  
- ```Linear algebra operations, Fourier transform, and random number generation```  
- ```A mature C API to enable Python extensions and native C or C++ code to access NumPy’s data structures and omputational facilities```




<span style="color:red; font-size:16px; font-family: monospace">**pandas**</span>:  
<span style="font-size:14px; font-family: monospace">pandas provides high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive. Since its emergence in 2010, it has helped enable Python to be a powerful and productive data analysis environment. The primary  bjects in pandas that will be used in this book are the DataFrame , a tabular, column-oriented data structure with both row and column labels, and the Series , a one-dimensional labeled array object.  
pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL).  
It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data. Since data manipulation, preparation, and cleaning is such an important skill in data analysis, pandas is one of the primary focuses of this book.  </span>




<span style="color:red; font-size:16px; font-family: monospace">**matplotlib**</span>:  
<span style="font-size:14px; font-family: monospace">matplotlib is the most popular Python library for producing plots and other twodimensional data visualizations. It was originally created by John D. Hunter and is now maintained by a large team of developers. It is designed for creating plots suitable for publication. While there are other visualization libraries available to Python programmers, matplotlib is the most widely used and as such has generally good integration with the rest of the ecosystem. I think it is a safe choice as a default visualization tool.</span>




<span style="color:red; font-size:16px; font-family: monospace">**SciPy**</span>:  
<span style="font-size:14px; font-family: monospace">SciPy is a collection of packages addressing a number of different standard problem
domains in scientific computing. Here is a sampling of the packages included:</span>  
- ```scipy.integrate```  
- - ```Numerical integration routines and differential equation solvers```
- ```scipy.linalg```  
- - ```Linear algebra routines and matrix decompositions extending beyond those provided in numpy.linalg```
- ```scipy.optimize```  
- - ```Function optimizers (minimizers) and root finding algorithms```
- ```scipy.signal```  
- - ```Signal processing tools```
- ```scipy.sparse```  
- - ```Sparse matrices and sparse linear system solvers```
- ```scipy.special```  
- - ```Wrapper around SPECFUN, a Fortran library implementing many common mathematical functions, such as the gamma function```
- ```scipy.stats```  
- - ```Standard continuous and discrete probability distributions (density functions, samplers, continuous distribution functions), various statistical tests, and more descriptive statistics```




<span style="color:red; font-size:16px; font-family: monospace">**scikit-learn**</span>  
<span style="font-size:14px; font-family: monospace">Since the project’s inception in 2010, scikit-learn has become the premier generalpurpose machine learning toolkit for Python programmers. In just seven years, it has had over 1,500 contributors from around the world. It includes submodules for such models as:</span>  
- ```Classification: SVM, nearest neighbors, random forest, logistic regression, etc.```  
- ```Regression: Lasso, ridge regression, etc.```  
- ```Clustering: k-means, spectral clustering, etc.```  
- ```Dimensionality reduction: PCA, feature selection, matrix factorization, etc.```  
- ```Model selection: Grid search, cross-validation, metrics```  
- ```Preprocessing: Feature extraction, normalization```  




<span style="color:red; font-size:16px; font-family: monospace">**statsmodels**</span>  
<span style="font-size:14px; font-family: monospace">statsmodels is a statistical analysis package that was seeded by work from Stanford University statistics professor Jonathan Taylor, who implemented a number of regression analysis models popular in the R programming language. Skipper Seabold and Josef Perktold formally created the new statsmodels project in 2010 and since then have grown the project to a critical mass of engaged users and contributors. Nathaniel Smith developed the Patsy project, which provides a formula or model specification framework for statsmodels inspired by R’s formula system.</span>  

<span style="font-size:14px; font-family: monospace">Compared with scikit-learn, statsmodels contains algorithms for classical (primarily frequentist) statistics and econometrics. This includes such submodules as:</span>  
- ```Regression models: Linear regression, generalized linear models, robust linear models, linear mixed effects models, etc.```  
- ```Analysis of variance (ANOVA)```  
- ```Clustering: k-means, spectral clustering, etc.```  
- ```Time series analysis: AR, ARMA, ARIMA, VAR, and other models```  
- ```Nonparametric methods: Kernel density estimation, kernel regression```  
- ```Visualization of statistical model results```  

<span style="font-size:14px; font-family: monospace">statsmodels is more focused on statistical inference, providing uncertainty estimates and p-values for parameters. scikit-learn, by contrast, is more prediction-focused.</span> 
