# Video: Linear Regression with Different Data Structures

This code example briefly repeats the code example for linear regression using native Python and NumpPy data structures to show off scikit-learn's flexibility.

## Linear Regression with Different Data Structures

* **Goal:** Demonstrate how scikit-learn works with a variety of input data structures covered in this module.
  * **Before:** Used pandas data frames in the previous video.
  * **Now:** Repeat analysis with NumPy and native Python lists.

## Code Example: Linear Regression with Different Data Structures

Script:
* We will start with the same imports and data loading as in the previous video, so I will repeat those quickly now.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.linear_model

In [None]:
abalone_pandas = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/main/data/abalone.tsv", sep="\t")

In [None]:
abalone_target_pandas = abalone["Rings"]
abalone_features_pandas = abalone.drop(["Rings", "Sex"], axis=1)

Script:
* To test each kid of data structure, I will write a function to take in the feature and target data, build a linear model and test it.


In [None]:
def test_data_structure(features, target):
    linear_model = sklearn.linear_model.LinearRegression()
    linear_model.fit(features, target)
    print("SCORE", linear_model.score(features, target))

    predictions = linear_model.predict(features)
    print("PREDICTION TYPE", type(predictions))

    return predictions

Script:
* To spot check how the model is working, this function will print out the R squared score, the type of the prediction output, and return the actual predictions which the notebook will print.
* Let's test it with the pandas data frames first.

In [None]:
test_data_structure(abalone_features_pandas, abalone_target_pandas)

SCORE 0.5276299399919837
PREDICTION TYPE <class 'numpy.ndarray'>


array([ 8.77893882,  7.23758471, 10.84589582, ..., 10.88079477,
        9.6174032 , 11.04119896])

Script:
* This matches the R^2 score when we previously ran the regression, and a NumPy array was returned.
* Let's try passing NumPy arrays now.

In [None]:
abalone_features_numpy = abalone_features_pandas.to_numpy()
abalone_features_numpy

array([[0.455 , 0.365 , 0.095 , ..., 0.2245, 0.101 , 0.15  ],
       [0.35  , 0.265 , 0.09  , ..., 0.0995, 0.0485, 0.07  ],
       [0.53  , 0.42  , 0.135 , ..., 0.2565, 0.1415, 0.21  ],
       ...,
       [0.6   , 0.475 , 0.205 , ..., 0.5255, 0.2875, 0.308 ],
       [0.625 , 0.485 , 0.15  , ..., 0.531 , 0.261 , 0.296 ],
       [0.71  , 0.555 , 0.195 , ..., 0.9455, 0.3765, 0.495 ]])

Script:
* The pandas data frame to_numpy method converts the data frame values into a two dimensional NumPy array.

In [None]:
abalone_target_numpy = abalone_target_pandas.to_numpy()
abalone_target_numpy

array([15,  7,  9, ...,  9, 10, 12])

Script:
* And the target pandas series is converted into a one dimensional NumPy array.
* Let's test building the scikit-learn model with those NumPy arrays.

In [None]:
test_data_structure(abalone_features_numpy, abalone_target_numpy)

SCORE 0.5276299399919837
PREDICTION TYPE <class 'numpy.ndarray'>


array([ 8.77893882,  7.23758471, 10.84589582, ..., 10.88079477,
        9.6174032 , 11.04119896])

Script:
* That looks just like the output when we ran the model with pandas data structures.
* I'll repeat this quickly with Python lists.
* The tolist method of numpy arrays will make this quick.

In [None]:
abalone_features_python = abalone_features_numpy.tolist()
abalone_features_python

[[0.455, 0.365, 0.095, 0.514, 0.2245, 0.101, 0.15],
 [0.35, 0.265, 0.09, 0.2255, 0.0995, 0.0485, 0.07],
 [0.53, 0.42, 0.135, 0.677, 0.2565, 0.1415, 0.21],
 [0.44, 0.365, 0.125, 0.516, 0.2155, 0.114, 0.155],
 [0.33, 0.255, 0.08, 0.205, 0.0895, 0.0395, 0.055],
 [0.425, 0.3, 0.095, 0.3515, 0.141, 0.0775, 0.12],
 [0.53, 0.415, 0.15, 0.7775, 0.237, 0.1415, 0.33],
 [0.545, 0.425, 0.125, 0.768, 0.294, 0.1495, 0.26],
 [0.475, 0.37, 0.125, 0.5095, 0.2165, 0.1125, 0.165],
 [0.55, 0.44, 0.15, 0.8945, 0.3145, 0.151, 0.32],
 [0.525, 0.38, 0.14, 0.6065, 0.194, 0.1475, 0.21],
 [0.43, 0.35, 0.11, 0.406, 0.1675, 0.081, 0.135],
 [0.49, 0.38, 0.135, 0.5415, 0.2175, 0.095, 0.19],
 [0.535, 0.405, 0.145, 0.6845, 0.2725, 0.171, 0.205],
 [0.47, 0.355, 0.1, 0.4755, 0.1675, 0.0805, 0.185],
 [0.5, 0.4, 0.13, 0.6645, 0.258, 0.133, 0.24],
 [0.355, 0.28, 0.085, 0.2905, 0.095, 0.0395, 0.115],
 [0.44, 0.34, 0.1, 0.451, 0.188, 0.087, 0.13],
 [0.365, 0.295, 0.08, 0.2555, 0.097, 0.043, 0.1],
 [0.45, 0.32, 0.1, 0.381, 0.

Script:
* That was a long preview.

In [None]:
abalone_target_python = abalone_target_numpy.tolist()
abalone_target_python

[15,
 7,
 9,
 10,
 7,
 8,
 20,
 16,
 9,
 19,
 14,
 10,
 11,
 10,
 10,
 12,
 7,
 10,
 7,
 9,
 11,
 10,
 12,
 9,
 10,
 11,
 11,
 12,
 15,
 11,
 10,
 15,
 18,
 19,
 13,
 8,
 16,
 8,
 11,
 9,
 9,
 14,
 5,
 5,
 4,
 7,
 9,
 7,
 6,
 9,
 8,
 7,
 10,
 10,
 7,
 8,
 8,
 8,
 4,
 7,
 7,
 9,
 10,
 7,
 8,
 8,
 12,
 13,
 10,
 6,
 13,
 8,
 20,
 11,
 13,
 15,
 9,
 10,
 11,
 14,
 9,
 12,
 16,
 21,
 14,
 12,
 13,
 10,
 9,
 12,
 15,
 12,
 13,
 10,
 15,
 14,
 9,
 8,
 7,
 10,
 7,
 15,
 15,
 10,
 12,
 12,
 11,
 10,
 9,
 9,
 9,
 9,
 9,
 9,
 11,
 11,
 11,
 10,
 9,
 8,
 9,
 7,
 14,
 6,
 6,
 5,
 6,
 8,
 19,
 18,
 17,
 9,
 7,
 7,
 7,
 8,
 7,
 9,
 9,
 9,
 10,
 10,
 16,
 11,
 10,
 10,
 10,
 9,
 5,
 4,
 15,
 9,
 10,
 10,
 12,
 10,
 13,
 16,
 13,
 13,
 13,
 13,
 12,
 18,
 16,
 14,
 20,
 20,
 14,
 12,
 14,
 7,
 8,
 8,
 5,
 7,
 5,
 8,
 4,
 11,
 14,
 21,
 10,
 10,
 12,
 13,
 12,
 10,
 11,
 9,
 13,
 12,
 14,
 8,
 10,
 12,
 11,
 16,
 15,
 10,
 9,
 13,
 12,
 13,
 8,
 9,
 9,
 8,
 13,
 7,
 10,
 7,
 12,
 9,
 14,
 10,
 8,
 7,
 

Script:
* Both lists of data are ready now.

In [None]:
test_data_structure(abalone_features_python, abalone_target_python)

SCORE 0.5276299399919839
PREDICTION TYPE <class 'numpy.ndarray'>


array([ 8.77893882,  7.23758471, 10.84589582, ..., 10.88079477,
        9.6174032 , 11.04119896])

Script:
* And we get back a NumPy array again.
* Scikit-learn sure likes NumPy arrays.
* I'll repeat the tests for the other data structures for comparison.

In [None]:
test_data_structure(abalone_features_numpy, abalone_target_numpy)

SCORE 0.5276299399919837
PREDICTION TYPE <class 'numpy.ndarray'>


array([ 8.77893882,  7.23758471, 10.84589582, ..., 10.88079477,
        9.6174032 , 11.04119896])

In [None]:
test_data_structure(abalone_features_pandas, abalone_target_pandas)

SCORE 0.5276299399919837
PREDICTION TYPE <class 'numpy.ndarray'>


array([ 8.77893882,  7.23758471, 10.84589582, ..., 10.88079477,
        9.6174032 , 11.04119896])

Script:
* Those are all the same.

## Scikit-learn Internals Use NumPy

* Anything that is array-like according to NumPy should work as an input.
  * Scikit-learn usually converts the input into an appropriate NumPy array early in the method call .
* Expect NumPy arrays out.
* Behavior may be different when other developers extend Scikit-learn classes.


Script:
* To sum up, Scikit-learn always uses NumPy arrays internally.
* Anything else passed as inputs is turned into a NumPy array.
* And NumPy arrays are always returned as output.
* You might see different behavior if another developer writes new modeling code extending Scikit-learn, but usually they will stick to this pattern.