# Get into shape! 🐍
## Understand and change Python data science object shapes

Shape errors are the bain of many folks learning machine learning. I would bet money that folks have quit their data science learning journey due to frustration with getting data into the shape required for machine learning algorithms. 

Having a strong understanding of how to reshape your data will spare you tears, save you time, and help you grow as a data scientist. 🎉

## Doing it
First, let's make sure we're using the same package versions.

Let's import the librarires we'll need under their usual aliases.  

In [2]:
import sys
import numpy as np
import scipy 
import pandas as pd
import tensorflow as tf
import torch
import sklearn
from sklearn.preprocessing import OneHotEncoder

If you don't have the libraries you need installed, uncomment the following cell and run it. Then run the cell  imports again.

In [3]:
# !pip install -U numpy scipy pandas tensorflow torch scikit-learn

Let's check our package versions.

In [55]:
print(f"Python: {sys.version}")
print(f'NumPy: {np.__version__}')
print(f'pandas: {pd.__version__}')
print(f'scikit-learn: {sklearn.__version__}')

Python: 3.8.5 (default, Sep  4 2020, 02:22:02) 
[Clang 10.0.0 ]
NumPy: 1.19.2
pandas: 1.2.0
scikit-learn: 0.24.0


# pandas 

## Dimensions

A pandas [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) has two dimensions: the rows and the columns. A pandas Series has one dimension: the rows.

Let's make a very small DataFrame with some hurricane data.

In [5]:
df_hurricanes = pd.DataFrame(dict(
    name=['Zeta', 'Andrew', 'Agnes'], 
    year=[2020, 1992, 1972 ]
))
df_hurricanes

Unnamed: 0,name,year
0,Zeta,2020
1,Andrew,1992
2,Agnes,1972


You can see the number of dimensions a pandas data structure with the [`ndim`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ndim.html) attribute. 

In [6]:
df_hurricanes.ndim

2

It has both rows and columns, so it has two dimensions.

## Shape

The shape attribute shows the number of items in each dimension. Checking a DataFrame's [`shape`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) returns a tuple with two integers. The first is the number of rows and the second is the number of columns. 👍

In [7]:
df_hurricanes.shape

(3, 2)

We have three rows and two columns. Cool. 😎

The [`size`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.size.html#pandas.DataFrame.size) attribute shows us how many cells we have. 

In [48]:
df_hurricanes.size

6

3 x 2 = 6

It's easy to get the number of dimensions and size form the `shape` attribute so that's the one we'll use. 🚀

Let's make a pandas Series from our DataFrame.

Use _just the brackets_ syntax to select a column by passing the name of the column as a string. You get back a Series.

In [9]:
years_series = df_hurricanes['year']
years_series

0    2020
1    1992
2    1972
Name: year, dtype: int64

In [10]:
type(years_series)

pandas.core.series.Series

What does the shape of a pandas [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) look like? We can use the Series [shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.shape.html) attribute to find out.

In [11]:
years_series.shape

(3,)

We have a tuple with just one value, the number of rows. Remember that the index doesn't count as a column. ☝️

What happens if we use _just the brackets_ again, except this time we pass a list containing a single column name?

In [12]:
years_df = df_hurricanes[['year']]
years_df

Unnamed: 0,year
0,2020
1,1992
2,1972


In [13]:
type(years_df)

pandas.core.frame.DataFrame

My variable name might have given away the answer. 😉 You always get back a DataFrame if you pass a list of column names. 

In [14]:
years_df.shape

(3, 1)

__Take away__: the shape of a pandas Series and the shape of a pandas DataFrame with one column are different! ADataFrame has a shape of _rows_ by _columns_ and a Series has a shape of _rows_. This is a key point that trips folks up.

---
# NumPy

Pandas and NumPy are bedrock libraries of data science in Python. Pandas extends NumPy. NumPy's ndarray is it's core data structure. An ndarray can have as many dimensions as your memory will allow.

There are many ways to create NumPy arrays, depending upon your goals. Check out my guide on the topic here. TK Let's make a NumPy array from our DataFrame and check it's shape. 

In [15]:
two_d_arr = df_hurricanes.to_numpy()
two_d_arr

array([['Zeta', 2020],
       ['Andrew', 1992],
       ['Agnes', 1972]], dtype=object)

In [16]:
type(two_d_arr)

numpy.ndarray

In [17]:
two_d_arr.shape

(3, 2)

The shape returned matches what we saw when we took the shape in pandas. Pandas and NumPy share some attributes and methods, including the _shape_ attribute.

Let's convert the pandas Series we made earlier into a NumPy array and check it's shape.

In [49]:
one_d_arr = years_series.to_numpy()
one_d_arr

array([2020, 1992, 1972])

In [50]:
type(one_d_arr)

numpy.ndarray

In [51]:
one_d_arr.shape

(3,)

Same result in pandas and NumPy. Cool!

## The problem

Things get tricky when an object expects data to arrive in a certain shape.

For example, most scikit-learn transformers and estimators expect to be fed their predictive X data in two-dimensional form. 
The target variable, y is expected to be one-dimensional. Let's demonstrate with a silly example where we use _year_ to predict the hurricane name.

We'll make _x_ lowercase because it has just one dimension.

In [52]:
x = df_hurricanes['year']
x

0    2020
1    1992
2    1972
Name: year, dtype: int64

In [53]:
type(x)

pandas.core.series.Series

In [54]:
x.shape

(3,)

Same goes for our output variable, _y_.

In [25]:
y = df_hurricanes['name']
y

0      Zeta
1    Andrew
2     Agnes
Name: name, dtype: object

In [26]:
type(y)

pandas.core.series.Series

In [27]:
y.shape

(3,)

Let's instantiate and fit a LogisticRegresison model.

In [28]:
from sklearn.linear_model import LogisticRegression

In [29]:
lr = LogisticRegression()

In [30]:
lr.fit(x, y)

ValueError: Expected 2D array, got 1D array instead:
array=[2020. 1992. 1972.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

And you get a value error. The last lines read:

```
ValueError: Expected 2D array, got 1D array instead:
array=[2020 1992 1972].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
```

In [31]:
x.reshape(-1, 1)

AttributeError: 'Series' object has no attribute 'reshape'

Reshaping is great if you passed a NumPy array, but we passed a pandas Series.  So we get an error.

We could change our Series into a NumPy array and then reshape it to have two dimensions. However, as you saw above, there's an easier way to make _x_ a 2D object. Just pass the columns as a list using _just the bracket_ syntax.

Let's do it!

In [32]:
x = df_hurricanes[['year']]
x

Unnamed: 0,year
0,2020
1,1992
2,1972


In [33]:
type(x)

pandas.core.frame.DataFrame

In [34]:
x.shape

(3, 1)

Now we can fit our model without errors! 😁 

In [35]:
lr.fit(x, y)

LogisticRegression()

## Reshaping NumPy arrays

If our data were stored in a 1D NumPy array, then we could do what the error message suggests and turn it into a 2D array with `reshape`. Let's try that with the data we saved as a 1D NumPy array earlier.

In [36]:
one_d_arr

array([2020, 1992, 1972])

In [37]:
one_d_arr.shape

(3,)

Let's reshape it.

In [38]:
two_d_arr_from_reshape = one_d_arr.reshape(-1, 1)
two_d_arr_from_reshape

array([[2020],
       [1992],
       [1972]])

In [39]:
two_d_arr_from_reshape.shape

(3, 1)

Now instead of a 1D array we have a 2D array! 🎺

What does using -1 with .reshape() do? Why do we use it? We could have gotten an array with shape _3, 1_ by passing `reshape(3, 1).

In [60]:
hard_coded_arr_shape = one_d_arr.reshape(3, 1)
hard_coded_arr_shape

array([[2020],
       [1992],
       [1972]])

In [61]:
hard_coded_arr_shape.shape

(3, 1)

Ok, we can hard code the shape. But generally it's better to do this dynamically. The _-1_ makes it dynamic.

Passing a positive integer means _give that dimension that shape_. Above we passed a _1_ so the the second dimension - the columns - got a 1. 

Passing a negative integer for the other dimension means that the remaining dimension gets whatever data is left to make it hold all the original data.

So here you end up with a 2D array with 3 rows and 1 column.

It's a good practice to make our code flexible so that it can handle how many ever observations we throw at it. So instead of hard-coding both dimensions, use `-1`. 🙂

The same principle can be followed for reshaping with higher dimensional arrays.

In [67]:
two_d_arr

array([['Zeta', 2020],
       ['Andrew', 1992],
       ['Agnes', 1972]], dtype=object)

In [68]:
two_d_arr.shape

(3, 2)

In [75]:
three_d_arr = two_d_arr.reshape(2, 1, 3)
three_d_arr

array([[['Zeta', 2020, 'Andrew']],

       [[1992, 'Agnes', 1972]]], dtype=object)

You can always use -1, to indicate which dimension should be the one to be computed to give all the data a home.

In [78]:
arr = two_d_arr.reshape(1, 2, -1, 1)
arr

array([[[['Zeta'],
         [2020],
         ['Andrew']],

        [[1992],
         ['Agnes'],
         [1972]]]], dtype=object)

Note that if the reshape dimensions don't make sense, you'll get an error.

In [79]:
two_d_arr.reshape(4, 2)
two_d_arr

ValueError: cannot reshape array of size 6 into shape (4,2)

### Predicting

If you have one single sample that you are predicting with sklearn, then make the rows _1_ and the columns match how many features you have. In other words, make the columns _-1_. If you don't do this, your prediction will fail.

In [64]:
lr.predict(np.array([2012]))

ValueError: Expected 2D array, got 1D array instead:
array=[2012].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

We can follow the helpful error suggestion and `reshape(1, -1)`. 

In [63]:
lr.predict(np.array([2012]).reshape(1, -1))

array(['Zeta'], dtype=object)

While we're on the topic of reshaping for scikit-learn, note that the text vectorization transformers behave differently than other scikit-learn transformers. They assume you have just one column of text, so they expect a 1D array. ⚠️ 

---
## Ways to make a 1D array

In addition to reshaping with `reshape`, NumPy's `flatten` and `ravel` both return a 1D array.
The differences are in whether they create a copy or a view of the original array and whether the data is stored contiguously in memory. Check out [this](https://stackoverflow.com/a/28930580/4590385) nice Stack Overflow answer for more info.

Let's look at one other way to squeeze a 2D array into a 1D array.

## Squeeze out unneeded dimensions
TK image of squeezing fruit

When you have a multi-dimensional array but one of the dimensions doesn't hold any new information you can _squeeze_ out the unnecessary dimension with `.squeeze()`. For example, let's use the array we made earlier.

In [40]:
two_d_arr_from_reshape

array([[2020],
       [1992],
       [1972]])

In [42]:
two_d_arr_from_reshape.shape

(3, 1)

In [43]:
squeezed = np.squeeze(two_d_arr_from_reshape)

In [44]:
squeezed.shape

(3,)

Ta da!

That's the end!


## Wrap

You've seen how to reshape NumPy arrays. Hopefully now a lot more mathematic, scientific, and machine learning code you see will make sense and you'll be able to quickly manipulate NumPy arrays into the shapes you neeed to.

The TensorFlow and PyTorch libraries play nicely with NumPy and can handle higher dimensional arrays representing things like video data. Getting the data into the shape the input layer to your neural network requires is a frequent source of errors. You can use the tools above to reshape your data into the required dimensions.

If you enjoyed TK

I help people learn how to data things with Python and other tools.

Happy reshaping! 🔵🔷

Maybe stop here TK

## Shapes for deep learning

NumPy arrays, TensorFlow tensors, and PyTorch tensors can handle higher dimensional arrays. 

A vanilla TensorFlow feed forward model expects a pandas DataFrame or NumPy array to have a shape that matches what is specified in the first layer.

For example, an image may have each image as a row with width, height, and color for each pixel.

You can reshape the dimensions. 

A 1D convolutional layer expects the data to be in the form 

Just watch out because a 2d Convolutional layer expects the data to be in 2 dimensions instead of 1.

Video data adds another dimension, as it consists of many frames in order.

Andrew Ng on his coursera lecture on deep learning has this tip to avoid errors - make everything 2d I think it was.