# Data Science Day 10

## 10 Simple Hacks to Speed Up Your Data Analysis in Python

1. Profiling the pandas dataframe
    - Profiling is a process that helps us in understanding our data and PandasProfiling is a Python package which does exactly that
    - It is a simple and fast way to perform exploratory data analysis (EDA) of a Pandas Dataframe
    - The pandas df.describe() and df.info() functions are normally used as a first step in the EDA process
    - However, it only gives a very basic over of the data and doesn't help much in the case of large data sets
    - The Pandas Profiling function, on the other hand, extends the pandas dataframe with df.profile_report() for quick data analysis
    - It displays a lot of information with a single line of code and that too in an interactive HTML report

2. Bringing interactivity to pandas plots
    - Pandas has a built-in .plot() function as part of the dataframe class
    - However, the visualizations rendered with this function aren't interactive and that makes it less appealing
    - On the contrary, the ease to plot charts with pands.DataFrame.plot() function also cannot be ruled out


3. A dash of magic
    - Magic commands are a set of convenient functions in Jupyter Notebooks
    - Magic commands are of two kinds:
        - Line magics, which are prefixed by a single % character and operate on a single line of input
        - Cell magics, which are associated with the double %% prefix and operate on multiple lines of input
    - Magic functions are callable without having to type the initial % if set to 1

4. Finding and eliminating errors
    - The interactive debugger is also a magic function
    - If you get an exceptionw hile running the code cell, type %debug in a new line and run it
    - This opens an interactive debugging environment which brings you to the position where the exception has occurred

## Data Manipulation with Pandas

- Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame
    - DataFrames are essentially multidimensional ararys with attached row and column labels, and often with heterogeneous types and/or missing data
- As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs

- Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time

### Installing and Using Pandas

- Installing of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built

In [2]:
import pandas
pandas.__version__

'1.0.5'

- Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd:

In [3]:
import pandas as pd

In [8]:
pd?

In [9]:
conda update pandas

Collecting package metadata (current_repodata.json): done
Solving environment: / 

Updating pandas is constricted by 

anaconda -> requires pandas==1.0.5=py38h959d312_0

If you are sure you want an update of your package either try `conda update --all` or install a specific version of the package you want using `conda install <pkg>=<version>`

done

## Package Plan ##

  environment location: /Users/ErinMcConnell/opt/anaconda3

  added / updated specs:
    - pandas


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.8.5                |           py38_0         2.8 MB
    ------------------------------------------------------------
                                           Total:         2.8 MB

The following packages will be UPDATED:

  conda                                        4.8.3-py38_0 --> 4.8.5-py38_0



Downloading and Extracting Packages
conda-4.8.5          | 2.8 MB  

## Introducing Pandas Objects

- At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices
- Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are

In [10]:
import numpy as np

### The Pandas Series Object

- A Pandas Series is a one-dimensional array of indexed data
- It can be created from a list of array as follows:

In [11]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

- As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes
- The values are simply a familiar NumPy array:

In [12]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

- The index is an array-like object of type pd.Index

In [13]:
data.index

RangeIndex(start=0, stop=4, step=1)

- Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [14]:
data[1]

0.5

In [15]:
data[1:3]

1    0.50
2    0.75
dtype: float64

- The Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates

#### Series as Generalized NumPy Array

- It may look like the Series object is basically interchangeable with a one-dimensional NumPy array
- The essential difference is the presence of the index: while the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values
- This explicit index definition gives the Series object additional capabilities
- For example, the index need not be an integer, but can consist of values of any desired type
- For example, we can use strings as an index:

In [16]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

- And the item access works as expected:

In [17]:
data['b']

0.5

- We can even use non-contiguous or non-sequentual indices:

In [18]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [19]:
data[5]

0.5

#### Series as Specialized Dictionary

- You can think of a Pandas Series a bit like a specialization of a Python dictionary
- A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values
- This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations
- The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

In [20]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

- By default, a Series will be created where the index is drawn from the sorted keys
- From here, typical dictionary-style item access can be performed:

In [21]:
population['California']

38332521

- Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

In [22]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

#### Constructing Series Objects

- Constructing a Pandas Series can be done by using some version of the following:
    pd.Series(data, index=index)
- where index is an optional argument, and data can be one of many entities
- For example, data can be a list or NumPy array, in which case index defaults to an integer sequence:

In [23]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

- data can be a scalar, which is repeated to fill the specified index:

In [24]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

- data can be a dictionary, in which index defaults to the sorted dictionary keys:

In [26]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

- In each case, the index can be explicitly set if a different result is preferred:

In [27]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

- Notice that in this case, the Series is populated only with the explicitly identified keys

### The Pandas DataFrame Object

- The DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary

In [32]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [33]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [34]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [35]:
states.columns

Index(['population', 'area'], dtype='object')

#### DataFrame as Specialized Dictionary

- Similarly, we can also think of a DataFrame as a specialization of a dictionary
- Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data
- For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

In [37]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

- Notice the potential point of confusion here: in a two-dimensional NumPy array, data[0] will return the first row
- For a DataFrame, data['col0'] will return the first column
- Because of this, it is probably better to think about DataFrames as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful

#### Constructing DataFrame Objects

- A Pandas DataFrame can be constructed in a variety of ways:

##### From a single Series object

- A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

In [38]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


##### From a list of dicts

- Any list of dictionaries can be made into a DataFrame

In [40]:
data = [{'a':i, 'b':2*i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


- Even if some keys in the dictionary are missing, Pandas will fill them inw ith NaN (i.e., "not a number") values:

In [41]:
pd.DataFrame([{'a':1, 'b':2}, {'b':3, 'c':4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


##### From a dictionary of Series objects

- A DataFrame can be constructed from a dictionary Series objects as well:

In [42]:
pd.DataFrame({'population':population, 'area':area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


##### From a two-dimensional NumPy array

- Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names
- If omitted, an integer index will be used for each:

In [43]:
pd.DataFrame(np.random.rand(3, 2),
            columns=['foo', 'bar'],
            index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.978874,0.326679
b,0.287176,0.25087
c,0.019664,0.63412


##### From a NumPy structured array

- A Pandas DataFrame operates much like a structured array, and can be created directly from one:

In [44]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [45]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


### The Pandas Index Object

- Both the Series and DataFrame objects contain an explicit index that lets you reference and modify data
- This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values)
- Those views have some interesting consequences in the operations available on Index objects

In [46]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

#### Index as Immutable Array

- The Index in many ways operates like an array
- For example, we can use standard Python indexing notation to retrieve values or slices:

In [47]:
ind[1]

3

In [48]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

- Index objects also have many of the attributes familiar from NumPy arrays:

In [49]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


- One difference between Index objects and NumPy arrays is that indices are immutable - that is, they cannot be modified via the normal means:

In [50]:
ind[1] = 0

TypeError: Index does not support mutable operations

- The immutability makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification

#### Index as Ordered Set

- Pandas objects are designed to facilitate operations such as joins across dataests, which depend on many aspects of set arithmetic
- The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [51]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [52]:
indA & indB #intersection

Int64Index([3, 5, 7], dtype='int64')

In [53]:
indA | indB #union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [54]:
indA ^ indB #symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

- These operations may also be accessed via object methods, for example indA.intersection(indB)

## Data Indexing and Selection

### Data Selection in Series

- A Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary

#### Series as Dictionary

- Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:

In [55]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [56]:
data['b']

0.5

- We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [57]:
'a' in data

True

In [58]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [59]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

- Series objects can even be modified with a dictionary-like syntax
- Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:

In [60]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

- This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place, the user generally does not need to worry about these issues

#### Series as One-Dimensional Array

- A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays - that is, slices, masking, and fancy indexing

In [61]:
#slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [63]:
#slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [64]:
#masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [65]:
#fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

- Among these, slicing may be the source of the most confusion
- Notice that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice

#### Indexers: loc, iloc, and ix

- These slicing and indexing conventions can be a source of confusion
- For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index

In [66]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [67]:
#explicit index when indexing
data[1]

'a'

In [68]:
#implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

- Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes
- These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series
- First, the loc attribute allows indexing and slicing that always references the explicit index:

In [69]:
data.loc[1]

'a'

In [70]:
data.loc[1:3]

1    a
3    b
dtype: object

- The iloc attribute allows indexing and slicing that always references the implicit Python-style index:

In [71]:
data.iloc[1]

'b'

In [72]:
data.iloc[1:3]

3    b
5    c
dtype: object

- A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to standard []-based indexing

### Data Selection in DataFrame

- A DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index

#### DataFrame as a Dictionary

In [73]:
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995})
pop = pd.Series({'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


- The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [74]:
data['pop']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64

- Equivalently, we can use attribute-style access with column names that are strings:

In [75]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

- This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [76]:
data.area is data['area']

True

- Though this is a useful shorthand, keep in mind that it does not work for all cases
- For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible
- For example, the DataFrame has a pop() method, so data.pop will point to this rather than the "pop" column

In [77]:
data.pop is data['pop']

False

- In particular, you should avoid the temptation to try column assignment via attribute (i.e., use data['pop'] = z rather than data.pop = z)
- This dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [78]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


- This shows a preview of the straightforward syntax of element-by-element arithmetic between Series objects

#### DataFrame as Two-Dimensional Array

- We can also view the DataFrame as an enhanced two-dimensional array
- We can examine the raw underlying data array using the values attribute:

In [79]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

- With this picture in mind, many familiar array-like observations can be done on the DataFrame itself
- For example, we can transpose the full DataFrame to swap rows and columns:

In [80]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


- When it comes to indexing of DataFrame objects; however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array
- In particular, passing a single index to an array accesses a row:

In [81]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

- And passing a single "index" to a DataFrame accesses a column:

In [82]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

- Thus for array-style indexing, we need another convention
- Here Pandas again uses the loc, iloc, and ix indexers
- Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:

In [83]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


- Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [84]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


- The ix indexer allows a hybrid of these two approaches
- Keep in mind that for integer indices, the ix indexer is subject to the same potential sources of confusion as discussed for integer-indexed Series objects
- Any of the familiar NumPy-style data access patterns can be used within these indexers
- For example, in the loc indexer we can continue masking and fancy indexing as in the following:

In [86]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


- Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy

In [87]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


#### Additional Indexing Conventions

- First, while indexing refers to columns, slicing refers to rows:

In [88]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


- Such slices can also refer to rows by number rather than by index:

In [89]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


- Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [90]:
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
