# Introduction to Numpy and Pandas

In [9]:
import pandas as pd
import numpy as np

## NumPy Arrays

NumPy arrays are unique in that they are more flexible than normal Python lists. They are called ndarrays since they can have any number (n) of dimensions (d). They hold a collection of items of any one data type and can be either a vector (one-dimensional) or a matrix (multi-dimensional). NumPy arrays allow for fast element access and efficient data manipulation.



The code below initializes a Python list named `list1`:

In [10]:
list1 = [1,2,3,4]

To convert this to a one-dimensional ndarray with one row and four columns, we can use the `np.array()` function:

In [11]:
array1 = np.array(list1)
print(array1)

[1 2 3 4]


To get a two-dimensional ndarray from a list, we must start with a Python list of lists:

In [13]:
list2 = [[1, 2, 3], [4, 5, 6]]
array2 = np.array(list2)
print(array2)

#the NumPy array print-out is displayed in a way that clearly
#demonstrates its multi-dimensional structure: two rows and three columns.

[[1 2 3]
 [4 5 6]]


Many operations can be performed on NumPy arrays which makes them very helpful for manipulating data:

* Selecting array elements

* Slicing arrays

* Reshaping arrays

* Splitting arrays

* Combining arrays

* Numerical operations (min, max, mean, etc)



## Pandas Series and Dataframes

### Series

A pandas Series is very similar to a one-dimensional NumPy array, but it has additional functionality that allows values in the Series to be indexed using labels. A NumPy array does not have the flexibility to do this. This labeling is useful when you are storing pieces of data that have other data associated with them. Say you want to store the ages of students in an online course to eventually figure out the average student age. If stored in a NumPy array, you could only access these ages with the internal ndarray indices `0,1,2....`. With a Series object, the indices of values are set to `0,1,2...` **by default**, but you can customize the indices to be other values such as student names so an age can be accessed using a name. Customized indices of a Series are established by sending values into the Series constructor, as you will see below.

A Series holds items of any one data type and can be created by sending in a scalar value, Python list, dictionary, or ndarray as a parameter to the pandas Series constructor. **If a dictionary is sent in, the keys may be used as the indices.**

In [15]:
# Create a Series using a NumPy array of ages with the default numerical indices
ages = np.array([13, 25, 19])
series1 = pd.Series(ages)
print(series1)

0    13
1    25
2    19
dtype: int64


When printing a Series, **the data type of its elements is also printed**. To customize the indices of a Series object, *use the index argument of the Series constructor.*

In [16]:
# Create a Series using a NumPy array of ages but customize the indices to be the names that correspond to each age
ages = np.array([13, 25, 19])
series1 = pd.Series(ages, index=['Emma', 'Swetha', 'Serajh'])
print(series1)

Emma      13
Swetha    25
Serajh    19
dtype: int64


### DataFrames

This object is similar in form to a matrix as it consists of rows and columns. Both rows and columns can be indexed with integers or String names. One DataFrame can contain many different types of data types, but within a column, everything has to be the same data type. A column of a DataFrame is essentially a Series. All columns must have the same number of elements (rows).

There are different ways to fill a DataFrame such as with a CSV file, a SQL query, a Python list, or a dictionary. Here we have created a DataFrame using a Python list of lists. Each nested list represents the data in one row of the DataFrame. We use the keyword columns to pass in the list of our custom column names.

In [25]:
dataf = pd.DataFrame(
    [['John Smith', '123 Main St', 34], ['Jane Doe', '456 Maple Ave', 28],
     ['Joe Schmo', '789 Broadway', 51]],
    columns=['name', 'address', 'age'])
print(dataf)

         name        address  age
0  John Smith    123 Main St   34
1    Jane Doe  456 Maple Ave   28
2   Joe Schmo   789 Broadway   51


The default row indices are `0,1,2...`, but these can be changed. For example, they can be set to be the elements in one of the columns of the DataFrame. To use the `names` column as indices instead of the default numerical values, we can run the following command on our DataFrame:

In [22]:
dataf.set_index('name')

Unnamed: 0_level_0,address,age
name,Unnamed: 1_level_1,Unnamed: 2_level_1
John Smith,123 Main St,34
Jane Doe,456 Maple Ave,28
Joe Schmo,789 Broadway,51
