# Numpy and Python Lists #

Numpy is primarily about working with arrays of numbers.  This is separate from the idea of a python list--although a python list can be coaxed into a numpy array. 

In [5]:
import numpy as np

In [6]:
mylist = [1,2,3,4,5]

In [7]:
mylist[0]

1

In [8]:
mylist[4]

5

In [9]:
mynp = np.array(mylist)

In [10]:
mylist

[1, 2, 3, 4, 5]

In [11]:
mynp

array([1, 2, 3, 4, 5])

In [12]:
mynp.dtype

dtype('int64')

In [13]:
mylist.dtype

AttributeError: 'list' object has no attribute 'dtype'

In [14]:
type(mylist)

list

In [15]:
mynp[0]

1

In [16]:
mynp[4]

5

In [17]:
my2d = np.array([[1,2,3],[4,5,6]])

In [18]:
my2d[0,0]

1

In [19]:
my2d[1,0]

4

In [20]:
my2d[1,2]

6

In [21]:
my2d.shape

(2, 3)

# Pandas dataframes are essentially dictionaries #

A dictionary is another python data structure.

In [22]:
mydictionary = { 'a': [1,2,3], 'b': [4,5,6]}

In [23]:
mydictionary['a']

[1, 2, 3]

In [24]:
mydictionary['b']

[4, 5, 6]

See [this page](https://www.w3schools.com/python/python_ref_dictionary.asp) for some methods for dealing with dictionaries.

In [25]:
import pandas as pd

In [26]:
mydf = pd.DataFrame(data=mydictionary)

In [27]:
mydf

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


# Overview of Modeling with SKLearn #

See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) for documentation on sklearn.  See [this page](https://scikit-learn.org/stable/) for any overview. See [this page](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) for an overview of LinearRegression.

There is essentially a four-step process for solving *any* problem in SKLearn:

+ Import the model
+ Make an instance of the model
+ Train the model
+ Predict with the model

In [28]:
from sklearn.linear_model import LinearRegression

In [None]:
help(LinearRegression)  # Better to use the webpage!

In [30]:
mylm = LinearRegression()

In [31]:
help(mylm.fit)

Help on method fit in module sklearn.linear_model._base:

fit(X, y, sample_weight=None) method of sklearn.linear_model._base.LinearRegression instance
    Fit linear model.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        Training data.
    
    y : array-like of shape (n_samples,) or (n_samples, n_targets)
        Target values. Will be cast to X's dtype if necessary.
    
    sample_weight : array-like of shape (n_samples,), default=None
        Individual weights for each sample.
    
        .. versionadded:: 0.17
           parameter *sample_weight* support to LinearRegression.
    
    Returns
    -------
    self : object
        Fitted Estimator.



In [32]:
help(mylm.predict)

Help on method predict in module sklearn.linear_model._base:

predict(X) method of sklearn.linear_model._base.LinearRegression instance
    Predict using the linear model.
    
    Parameters
    ----------
    X : array-like or sparse matrix, shape (n_samples, n_features)
        Samples.
    
    Returns
    -------
    C : array, shape (n_samples,)
        Returns predicted values.



# Subtleties with Pandas, Numpy, and SKLearn #

The fundamental issue is that the .fit() and .predict() methods both require the data to have a very specific **shape**.  See [this page](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) and note your data may not be given that way. In particular, from the above:

+ X (input variables) should be of shape (n_samples, n_features)
+ y (output variablee) should be of shape (n_samples,)

### **Example:** Predict height from wingspan with LinearRegresssion ###

In [37]:
import pandas as pd

In [38]:
mydata = pd.read_csv("https://raw.githubusercontent.com/aleahy-work/STAT223-S24/main/Data/firstdaydata.csv")

In [39]:
mydata.head(4)

Unnamed: 0,randnum,diehard,earlobes,hitchthumb,age,siblings,pizzatop,knoxyear,eyecolor,cellphone,hand,gender,height,wingspan,cubit,handspan
0,7,Yes,Detached,Hitchhikers thumb,19,1,Pepperoni,Sophomore,Brown,Android,Right,Female,66.0,67.1,17.5,7.5
1,7,Yes,Attached,Hitchhikers thumb,21,0,Cheese,Senior and beyond,Brown,Android,Left,Female,64.5,68.0,18.0,6.5
2,9,No,Detached,No hitchhikers thumb,19,2,Cheese,Sophomore,Brown,Android,Right,Male,73.0,71.0,18.5,7.5
3,3,No,Detached,No hitchhikers thumb,19,0,Cheese,Sophomore,Other,Android,Right,Male,67.0,69.1,18.4,7.9


In [40]:
import numpy as np

In [42]:
mydata['wingspan'].shape  # Note that this is the wrong shape for an input!

(39,)

In [44]:
mydata['wingspan'].reshape(39,1).shape

AttributeError: 'Series' object has no attribute 'reshape'

Reshaping is something that numpy does very well, but the problem is that it pandas doesn't do this.  SO . . . we take the "values" from the series--which is just a numpy array--and use the numpy .reshape to reshape the data to the right size.

In [43]:
mydata['wingspan'].values

array([67.1, 68. , 71. , 69.1, 61.2, 67.5, 71. , 70.1, 61.5, 56. , 65. ,
       65. , 74. , 62.3, 71.5, 72.5, 68. , 70. , 65. , 62.8, 74. , 72. ,
       68. , 62. , 55.3, 70.5, 71.7, 77.8, 67.9, 63. , 63. , 78. , 64.5,
       62. , 59.6, 72. , 63.4, 69.7, 62.5])

In [57]:
mydata['wingspan'].values.reshape(39,1)

array([[67.1],
       [68. ],
       [71. ],
       [69.1],
       [61.2],
       [67.5],
       [71. ],
       [70.1],
       [61.5],
       [56. ],
       [65. ],
       [65. ],
       [74. ],
       [62.3],
       [71.5],
       [72.5],
       [68. ],
       [70. ],
       [65. ],
       [62.8],
       [74. ],
       [72. ],
       [68. ],
       [62. ],
       [55.3],
       [70.5],
       [71.7],
       [77.8],
       [67.9],
       [63. ],
       [63. ],
       [78. ],
       [64.5],
       [62. ],
       [59.6],
       [72. ],
       [63.4],
       [69.7],
       [62.5]])

In [44]:
myx = mydata['wingspan'].values.reshape(39,1)  # Now the data is in the right shape, so we give it a name

In [45]:
myx.shape

(39, 1)

In [46]:
mydata['height'].values

array([66. , 64.5, 73. , 67. , 64.8, 70.1, 71. , 71. , 63.5, 58. , 67. ,
       65.7, 73. , 62.3, 73. , 71. , 68. , 71.8, 64.5, 66.5, 72. , 74. ,
       67. , 64. , 66. , 70.1, 70.4, 75.2, 69.1, 65. , 64. , 74.5, 64.8,
       64.5, 68. , 70. , 63.8, 69.4, 64.5])

In [47]:
mydata['height'].values.shape  # This is the right shape for an output, so we are fine

(39,)

In [48]:
myy = mydata['height'].values

In [50]:
mymodel = mylm.fit(myx, myy)

In [52]:
mymodel.coef_

array([0.63477313])

In [53]:
mymodel.intercept_

25.32694551620032

In [54]:
mymodel.score(myx, myy)

0.7505320966768354

In [72]:
mymodel.predict(70)

ValueError: Expected 2D array, got scalar array instead:
array=70.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

# Why is this a problem? #

From the .predict() documentation, X (the input) is expected to haveshape (n_samples, n_features) and "70" doesn't have the correct shape.  

In [56]:
np.array([70]).shape  # Even [70] is the wrong shape

(1,)

In [57]:
np.array([[70]]).shape  # But [[70]] has the right shape

(1, 1)

In [58]:
mymodel.predict([[70]])

array([69.76106471])