## Numpy


Create an array

In [None]:
np.array()

Examples:

In [None]:
a = np.array([1, 2, 3])

In [None]:
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

### Indexing

In [None]:
a[0]

In [None]:
matrix[1, 1]

To get the first 2 elements of the last 2 rows:

In [None]:
matrix[1:, :2]

Arrays can also be indexed with a boolean sequence used to indicate which values should be included in the resulting array.



In [None]:
should_include_elements = [True, False, True]

### Vectorized Operations

In [None]:
my_array      == [-3  0  3 16]
my_array - 5  == [-8 -5 -2 11]
my_array * 4  == [-12   0  12  64]
my_array / 2  == [-1.5  0.   1.5  8. ]
my_array ** 2 == [  9   0   9 256]
my_array % 2  == [1 0 1 0]

### Vectorized Comparison Operators

In [None]:
my_array       == [-3  0  3 16]
my_array == -3 == [ True False False False]
my_array >= 0  == [False  True  True  True]
my_array < 10  == [ True  True  True False]

### Variable Substitution (Boolean Masks)

In [None]:
my_array[my_array % 2 == 0]

my_array[] is substituting the original values for the inner Boolean Mask

    It's wrapping the Boolean Mask

### random.randn(), zeros(), ones()

In [None]:
np.random.rand(10)            (random array, uniform distribution between 1 and 0)
np.random.randn(10)           (random array, normal distribution between 1 and 0)
np.random.randint(50, 100, 5) (random array, integers between start, stop, length)

np.random.randint(low=60, high=100, size=len(students))
                              (another use example above)

np.random.randn(3, 4)         (random matrix)

In [None]:
np.zeros(3)    == [0. 0. 0.]   (create an array of zeros with specified length)
np.ones(3)     == [1. 1. 1.]   (create an array of ones with specified length)
np.full(3, 17) == [17 17 17]   (create an array of specialized size with a default value)

### arange()

In [None]:
np.arange(1, 4, 2)   (arange() can handle decimal numbers)

(start, [stop], [step])   (stop is not inclusive)

### linspace()

In [None]:
np.linspace(1, 4, 4)  (create a range of numbers with a set number of element between the min and max)

(min, max, length)    (the max is inclusive)

### Array Methods and Properties

In [None]:
a.min()
a.max()
a.mean()
a.sum()
a.std()
idn = np.eye(4)   (identity matrix creates a diagonal of ones across all zeros)
a.reshape(#,#)    (rows, columns) (elements in array must == the product of the desired rows and columns)
array.T           (transpose array)
array.transpose() (transpose array)

### Functions

In [None]:
np.median(array)
np.arange()
np.linspace()
np.idxmax(array)             (returns the first row label of the max value)
np.idxmin(array)             (returns the first row label of the min value)
argmin(array)           (returns the index of the max value)
argmax(array)           (returns the index of the min value)
np.log(array)           (returns an array with the log of the elements of the input)
np.exp(array)           (returns an array with the exponents of the elements)
np.sqrt(array)
np.sin(array)
x_array.dot(y_array)    (returns the sum of the matrices multiplied together)
                        (same as sum(x_array * y_array))
np.dot(x_array, y_array)(returns a new matrix for product of elements in input matrices)


## Pandas Series

Series can be created from a list or a numpy array:

In [None]:
series = pd.Series([100, 43, 26, 17])

Name your series if you like with:

In [None]:
series.name = 'My Numbers'

In [None]:
series.index
series.dtype
series.astype()       (astype("str"), astype("float"))

### Series Methods


In [None]:
series.any()                (series < 0).any()
series.all()                (series < 0).all()
series.head()
series.tail()
series.value_counts         (returns a count of the unique values in a series)
series.isin(set_of_values)  (returns whether or not each value in a series is in a set of known values)
series.unique()             (returns unique values as NumPy array)
series.str.contains("blah")

### Series Functions

In [None]:
series.count()
series.sum()
series.mean()
series.median()
series.min()
series.max()
series.mode()
series.abs()
series.std()
series.quantile(value at %)
series.cumsum()
series.cummax()
series.cummin()
series.apply(function name or lambda function)    series.apply(lambda n: 'even' if n % 2 == 0 else 'odd')
string_series.str.lower()        (vectorize string functions)
string_series.str.capitalize()

### Subsetting and Indexing


Like numpy arrays, we can use a series of boolean values to subset a series.



In [None]:
series[series > 40]
letters_series[letters_series.isin(vowels)]

### Transforming Numerical to Categorical Values


We can use the cut function from pandas to put numerical values into discrete bins.

We can either specify the number of bins to create, and pandas will create bins with an even size, or we can specify the bins themselves:



In [None]:
s = pd.Series(list(range(15)))      (return a series of ints 0 to 14)
pd.cut(s, 3)                        (put the series into three equal bins)
pd.cut(s, [-1, 3, 12, 16])          (put the series into specified bins)

## Dataframes


### Dataframe Methods and Attributes

In [None]:
df.info()
df.describe()
df.dtypes         (returns the type of each column)
df.shape          (returns the number of rows and columns)
df.index          (returns the labels for each row (usually autogen number))

df.columns        (This can return the column names with dtype    OR
                  (.columns can be assigned to to change the name of columns)
df.columns = [col.upper() for col in df.columns]

df.head()
df.tail()
df.sample()

### Accessing Multiple Columns


In [None]:
df[['name', 'math']]

columns = ['name', 'math']
df[columns]


### Accessing Individual Columns


In [None]:
df.math
df['math']

### Subsetting and Indexing


In [None]:
df.math < 80             (returns a boolean series or boolean mask)
df[df.math < 80]         (returns the entire row in df where condition is True)
                         (wrapping the boolean mask in original df)

### Dropping and Renaming Columns


In [None]:
#(Both of these methods produce new dataframes, not change the original)
# unless you change the optional kwarg inplace to inplace=True)

df.drop()

df.drop(columns=['english', 'reading'])

.rename

df.rename(columns={'name': 'student'})
                  (original: new_name)

Because these methods each return a dataframe, we can chain them together:



In [None]:
df.drop(columns=['english', 'reading']).rename(columns={'name': 'student'})

### Creating New Columns


In [None]:
# This method creates a new column in our original dataframe
# create a new column based on the contents of another 

df['passing_math'] = df.math > 70

# df.assign() returns a new dataframe and not modify existing df

df.assign()

df.assign(passing_english=df.english >= 70)

### Sorting Dataframes


In [None]:
.sort_values()

df.sort_values(by='english')
df.sort_values(by='english', ascending=False)

### Chaining Dataframe Methods


Because most dataframe methods return another dataframe, it is common to see them chained together.



In [None]:
df[df.english > 90].sort_values(by='english').head(1).name

### Creating Dataframes


From Lists and Dictionaries


In [None]:
pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

pd.DataFrame([[1, 2, 3], [4, 5, 6]])

data = np.array([[1, 2, 3], [4, 5, 6]])
pd.DataFrame(data, columns=['a', 'b', 'c'])
# Notice here that we had to specify the names of the columns ourselves.

From Text Files


In [None]:
read_csv
read_json
read_table
read_sql    (method to create a dataframe based on the results of a SQL query)

# database connection urls will have this format
# (protocol://[user[:password]@]hostname/[database_name])
# example:  mysql+pymysql://codeup:p@assw0rd@123.123.123.123/some_db

Another thing we need to consider is that we don't want to publish our database credentials to github, however, we will need access to these values in our code in order to create the connection string defined above.

In order to accomplish this, we can define several variables in a file named env.py that contain the sensitive data, add env.py to our .gitignore file, and then import those values into another script.

In [None]:
from env import host, user, password

url = f'mysql+pymysql://{user}:{password}@{host}/employees'

Once this url is defined, we can use it with the read_sql function to have pandas treat the results of a SQL query as a dataframe.

In [None]:
pd.read_sql('SELECT * FROM employees LIMIT 5 OFFSET 50', url)

It is common to have longer SQL queries that we want to read into python, and an example of how we might break a query into several lines is below:



In [None]:
sql = '''
SELECT
    emp_no,
    first_name,
    last_name
FROM employees
WHERE gender = 'F'
LIMIT 5
OFFSET 200
'''

pd.read_sql(sql, url)

In [None]:
query = '''
SELECT
    t.title as title,
    d.dept_name as dept_name
FROM titles t
JOIN dept_emp USING (emp_no)
JOIN departments d USING (dept_no)
'''

employees = pd.read_sql(query, url)

### Aggregation


In [None]:
# The .agg function lets us specify a way to aggregate a series of numerical values.

df.reading.agg('min')

df[['english', 'reading', 'math']].agg(['mean', 'min', 'max'])
# This returns a dataframe with the above columns and calculated values

### Grouping


The .groupby method is used to created a grouped object, which we can then apply an aggregation on. For example, if we wanted to know the highest math grade from each classroom:



In [None]:
df.groupby('classroom').math.max()

df.groupby('classroom').math.agg(['min', 'mean', 'max'])
# Multiple aggregations

(df
 .assign(passing_math=df.math.apply(lambda n: 'failing' if n < 70 else 'passing'))
 .groupby(['passing_math', 'classroom']) # note we now pass a list of columns
 .reading
 .agg(['mean', 'count']))
# Create boolean column named passing_math
# Groupby new feature and the classroom
# Calculate the average reading grade and number of students in each subgroup


### .transform()

The transform method can be used to produce a series with the same length of the original dataframe where each value represents the aggregation from the grouped by subgroup. For example, if we wanted to know the average math grade for each classroom, and add this data back to our original dataframe:

In [None]:
df.assign(avg_math_score_by_classroom=df.groupby('classroom').math.transform('mean'))

df.groupby('classroom').reading.describe()
# group by classroom and get stats breakdown on reading grades

### Merging and Joining


In [None]:
pd.concat     (combine dataframes vertically one after the other)

df1 = pd.DataFrame({'a': [1, 2, 3]})
df2 = pd.DataFrame({'a': [4, 5, 6]})

pd.concat([df1, df2])
(indeces are preserved from original dataframes, 
 reset_index method can be used to make these sequential)

In [None]:
pd.merge      (combine dataframes horizontally like SQL Join)

pd.merge(users, roles, left_on='role_id', right_on='id', how='left')

pd.merge(
    users.rename(columns={'id': 'user_id', 'name': 'username'}),
    roles.rename(columns={'name': 'role_name'}),
    left_on='role_id', right_on='id', how='left')

### Reshaping
