# Date fruit and Pumpkin seed dataset

# Techniques
In this code-along exercise, we will cover the following data processing techniques:
- Basic data views
- Renaming columns with `df.rename()`
- Column selection with 
- Filtering, viewing subsets with `df.loc[]`
- Sorting with `df.sort()`
- Merges with `pd.join` or `df.merge`
- Removing outliers with custom functions
- Normalization 

[Pandas documentation](https://pandas.pydata.org/docs/reference/frame.html)

# Imports
By convention, we use the `as` in the import statement to alias `numpy` to `np`. Similarly, we alias `pandas` to `pd`. Another convention we will use is calling Pandas DataFrame objects `df`.

In [1]:
import numpy as np
import pandas as pd

# Load the data
The [Pumpkin seeds dataset](https://www.kaggle.com/datasets/muratkokludataset/pumpkin-seeds-dataset) and the [Date fruits datset](https://www.kaggle.com/datasets/muratkokludataset/date-fruit-datasets) features were derived from automated analysis of the respective foods. 

The goals of working with these datasets are as follows:
- Join the two datasets to do classification between the two classes of foods
- Clean up the features to prepare them for use in a machine learning algorithm
- Learn about basic DataFrame manipulation

In [2]:
dfd = pd.read_csv("/Users/trevoryu/Code/data/bdpp_data/date_fruits.csv")
dfp = pd.read_csv("/Users/trevoryu/Code/data/bdpp_data/pumpkin_seeds.csv")

# DataFrame basics

To view the first `n` lines of a DataFrame, we can use `df.head(n)`. By default, `n = 5`. Similarly, to view the last `n` lines, we can use `df.tail(n)`. 

In the Jupyter Notebook, the last line of each cell executed prints to the notebook. Simply writing `df` as the last line in a cell will print out an abridged version of the DataFrame.

In [3]:
dfd.head()

Unnamed: 0,AREA,PERIMETER,MAJOR_AXIS,MINOR_AXIS,ECCENTRICITY,EQDIASQ,SOLIDITY,CONVEX_AREA,EXTENT,ASPECT_RATIO,...,KurtosisRR,KurtosisRG,KurtosisRB,EntropyRR,EntropyRG,EntropyRB,ALLdaub4RR,ALLdaub4RG,ALLdaub4RB,Class
0,422163,2378.908,837.8484,645.6693,0.6373,733.1539,0.9947,424428,0.7831,1.2976,...,3.237,2.9574,4.2287,-59191260000.0,-50714214400,-39922372608,58.7255,54.9554,47.84,BERHI
1,338136,2085.144,723.8198,595.2073,0.569,656.1464,0.9974,339014,0.7795,1.2161,...,2.6228,2.635,3.1704,-34233070000.0,-37462601728,-31477794816,50.0259,52.8168,47.8315,BERHI
2,526843,2647.394,940.7379,715.3638,0.6494,819.0222,0.9962,528876,0.7657,1.315,...,3.7516,3.8611,4.7192,-93948350000.0,-74738221056,-60311207936,65.4772,59.286,51.9378,BERHI
3,416063,2351.21,827.9804,645.2988,0.6266,727.8378,0.9948,418255,0.7759,1.2831,...,5.0401,8.6136,8.2618,-32074310000.0,-32060925952,-29575010304,43.39,44.1259,41.1882,BERHI
4,347562,2160.354,763.9877,582.8359,0.6465,665.2291,0.9908,350797,0.7569,1.3108,...,2.7016,2.9761,4.4146,-39980970000.0,-35980042240,-25593278464,52.7743,50.908,42.6666,BERHI


In [4]:
dfp.tail(10)

Unnamed: 0,Area,Perimeter,Major_Axis_Length,Minor_Axis_Length,Convex_Area,Equiv_Diameter,Eccentricity,Solidity,Extent,Roundness,Aspect_Ration,Compactness,Class
2490,51555,934.911,401.8321,164.7038,52013,256.2067,0.9121,0.9912,0.7187,0.7412,2.4397,0.6376,Ürgüp Sivrisi
2491,69836,1010.605,396.6286,224.7918,70419,298.1911,0.8239,0.9917,0.6693,0.8593,1.7644,0.7518,Ürgüp Sivrisi
2492,84236,1274.656,456.9323,237.154,85248,327.4944,0.8548,0.9881,0.6104,0.6515,1.9267,0.7167,Ürgüp Sivrisi
2493,58987,977.41,404.0779,186.371,59518,274.0522,0.8873,0.9911,0.7327,0.7759,2.1681,0.6782,Ürgüp Sivrisi
2494,79755,1146.431,470.3888,217.8296,80649,318.6647,0.8863,0.9889,0.7175,0.7626,2.1594,0.6774,Ürgüp Sivrisi
2495,79637,1224.71,533.1513,190.4367,80381,318.4289,0.934,0.9907,0.4888,0.6672,2.7996,0.5973,Ürgüp Sivrisi
2496,69647,1084.318,462.9416,191.821,70216,297.7874,0.9101,0.9919,0.6002,0.7444,2.4134,0.6433,Ürgüp Sivrisi
2497,87994,1210.314,507.22,222.1872,88702,334.7199,0.899,0.992,0.7643,0.7549,2.2828,0.6599,Ürgüp Sivrisi
2498,80011,1182.947,501.9065,204.7531,80902,319.1758,0.913,0.989,0.7374,0.7185,2.4513,0.6359,Ürgüp Sivrisi
2499,84934,1159.933,462.8951,234.5597,85781,328.8485,0.8621,0.9901,0.736,0.7933,1.9735,0.7104,Ürgüp Sivrisi


In [5]:
dfd

Unnamed: 0,AREA,PERIMETER,MAJOR_AXIS,MINOR_AXIS,ECCENTRICITY,EQDIASQ,SOLIDITY,CONVEX_AREA,EXTENT,ASPECT_RATIO,...,KurtosisRR,KurtosisRG,KurtosisRB,EntropyRR,EntropyRG,EntropyRB,ALLdaub4RR,ALLdaub4RG,ALLdaub4RB,Class
0,422163,2378.9080,837.8484,645.6693,0.6373,733.1539,0.9947,424428,0.7831,1.2976,...,3.2370,2.9574,4.2287,-5.919126e+10,-50714214400,-39922372608,58.7255,54.9554,47.8400,BERHI
1,338136,2085.1440,723.8198,595.2073,0.5690,656.1464,0.9974,339014,0.7795,1.2161,...,2.6228,2.6350,3.1704,-3.423307e+10,-37462601728,-31477794816,50.0259,52.8168,47.8315,BERHI
2,526843,2647.3940,940.7379,715.3638,0.6494,819.0222,0.9962,528876,0.7657,1.3150,...,3.7516,3.8611,4.7192,-9.394835e+10,-74738221056,-60311207936,65.4772,59.2860,51.9378,BERHI
3,416063,2351.2100,827.9804,645.2988,0.6266,727.8378,0.9948,418255,0.7759,1.2831,...,5.0401,8.6136,8.2618,-3.207431e+10,-32060925952,-29575010304,43.3900,44.1259,41.1882,BERHI
4,347562,2160.3540,763.9877,582.8359,0.6465,665.2291,0.9908,350797,0.7569,1.3108,...,2.7016,2.9761,4.4146,-3.998097e+10,-35980042240,-25593278464,52.7743,50.9080,42.6666,BERHI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
893,255403,1925.3650,691.8453,477.1796,0.7241,570.2536,0.9785,261028,0.7269,1.4499,...,2.2423,2.3704,2.7202,-2.529642e+10,-19168882688,-18473392128,49.0869,43.0422,42.4153,SOGAY
894,365924,2664.8230,855.4633,551.5447,0.7644,682.5752,0.9466,386566,0.6695,1.5510,...,3.4109,3.5805,3.9910,-3.160522e+10,-21945366528,-19277905920,46.8086,39.1046,36.5502,SOGAY
895,254330,1926.7360,747.4943,435.6219,0.8126,569.0545,0.9925,256255,0.7240,1.7159,...,2.2759,2.5090,2.6951,-2.224277e+10,-19594921984,-17592152064,44.1325,40.7986,40.9769,SOGAY
896,238955,1906.2679,716.6485,441.8297,0.7873,551.5859,0.9604,248795,0.6954,1.6220,...,2.6769,2.6874,2.7991,-2.604860e+10,-21299822592,-19809978368,51.2267,45.7162,45.6260,SOGAY


Note that the Jupyter Notebook also prints out the number of rows and columns of the DataFrame when we just let it autoprint the DataFrame. We can calso view the shape of the DataFrame using `df.shape`. By convention, the first number in the tuple is the number of rows and the second number is the number of columns.

In [6]:
dfd.shape

(898, 35)

We can access the index and column axes of the DataFrame with `df.index` and `df.columns` respectively. The values of the DataFrame can be access with `df.values` which returns a numpy array.

In [7]:
dfp.index

RangeIndex(start=0, stop=2500, step=1)

In [8]:
dfp.columns

Index(['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length',
       'Convex_Area', 'Equiv_Diameter', 'Eccentricity', 'Solidity', 'Extent',
       'Roundness', 'Aspect_Ration', 'Compactness', 'Class'],
      dtype='object')

In [9]:
dfp.values

array([[56276, 888.242, 326.1485, ..., 1.4809, 0.8207, 'Çerçevelik'],
       [76631, 1068.146, 417.1932, ..., 1.7811, 0.7487, 'Çerçevelik'],
       [71623, 1082.987, 435.8328, ..., 2.0651, 0.6929, 'Çerçevelik'],
       ...,
       [87994, 1210.314, 507.22, ..., 2.2828, 0.6599, 'Ürgüp Sivrisi'],
       [80011, 1182.947, 501.9065, ..., 2.4513, 0.6359, 'Ürgüp Sivrisi'],
       [84934, 1159.933, 462.8951, ..., 1.9735, 0.7104, 'Ürgüp Sivrisi']],
      dtype=object)

Other conventions
- Specify one column with `"column name"`, specify multiple columns with `["list", "of", "column", "names"]`
- Index (`axis=0`) contains a unique identifier for each of the rows; Columns (`axis=1`) contain a unique identifier for each of the columns
- Most operations default to applying to the Index axis. It's best practice to specify the axis directly for clairty.

# Renaming Columns
Analyzing the column names, we notice a few things about the dataset. 

Of benefit:
1. Both datasets use many of the same columns
2. Both datasets use the underscores to delimit words

Of concern:
1. Both datasets use different capitalization 
2. The features of "equivalent diameter" and "aspect ratio" are spelled differently
3. The major and minor axis features are missing the word "length" for the data fruits

In this section, we will deal with these concerns by renaming the columns using `df.rename()`

`df.rename()` has two forms:
- `df.rename(mapper=function, axis="columns")` where we specify a function that is applied to all the column names. The function should result in all unique outputs when applied to all column names.
- `df.rename(mapper=Dict, axis="columns")` where we would specify a dictionary with keys as the old column names and values as the new column names. The columns to rename do not have to exist in the DataFrame; any keys in the dict that are not present will have no effect on the DataFrame and no error will be raised.

To deal with the capitalization, we will apply a function that transforms the string into all lowercase. To deal with the different names, we will make a dictionary that maps the lower-case incorrect names to correct ones.

In [10]:
mapper_function = lambda x: x.lower()

mapper_dict = {
    "aspect_ration": "aspect_ratio",
    "eqdiasq": "equiv_diameter",
    "major_axis": "major_axis_length",
    "minor_axis": "minor_axis_length"
}

dfd = dfd.rename(mapper=mapper_function, axis="columns")
dfd = dfd.rename(mapper=mapper_dict, axis="columns")

dfp = dfp.rename(mapper=mapper_function, axis="columns")
dfp = dfp.rename(mapper=mapper_dict, axis="columns")

In [11]:
dfd

Unnamed: 0,area,perimeter,major_axis_length,minor_axis_length,eccentricity,equiv_diameter,solidity,convex_area,extent,aspect_ratio,...,kurtosisrr,kurtosisrg,kurtosisrb,entropyrr,entropyrg,entropyrb,alldaub4rr,alldaub4rg,alldaub4rb,class
0,422163,2378.9080,837.8484,645.6693,0.6373,733.1539,0.9947,424428,0.7831,1.2976,...,3.2370,2.9574,4.2287,-5.919126e+10,-50714214400,-39922372608,58.7255,54.9554,47.8400,BERHI
1,338136,2085.1440,723.8198,595.2073,0.5690,656.1464,0.9974,339014,0.7795,1.2161,...,2.6228,2.6350,3.1704,-3.423307e+10,-37462601728,-31477794816,50.0259,52.8168,47.8315,BERHI
2,526843,2647.3940,940.7379,715.3638,0.6494,819.0222,0.9962,528876,0.7657,1.3150,...,3.7516,3.8611,4.7192,-9.394835e+10,-74738221056,-60311207936,65.4772,59.2860,51.9378,BERHI
3,416063,2351.2100,827.9804,645.2988,0.6266,727.8378,0.9948,418255,0.7759,1.2831,...,5.0401,8.6136,8.2618,-3.207431e+10,-32060925952,-29575010304,43.3900,44.1259,41.1882,BERHI
4,347562,2160.3540,763.9877,582.8359,0.6465,665.2291,0.9908,350797,0.7569,1.3108,...,2.7016,2.9761,4.4146,-3.998097e+10,-35980042240,-25593278464,52.7743,50.9080,42.6666,BERHI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
893,255403,1925.3650,691.8453,477.1796,0.7241,570.2536,0.9785,261028,0.7269,1.4499,...,2.2423,2.3704,2.7202,-2.529642e+10,-19168882688,-18473392128,49.0869,43.0422,42.4153,SOGAY
894,365924,2664.8230,855.4633,551.5447,0.7644,682.5752,0.9466,386566,0.6695,1.5510,...,3.4109,3.5805,3.9910,-3.160522e+10,-21945366528,-19277905920,46.8086,39.1046,36.5502,SOGAY
895,254330,1926.7360,747.4943,435.6219,0.8126,569.0545,0.9925,256255,0.7240,1.7159,...,2.2759,2.5090,2.6951,-2.224277e+10,-19594921984,-17592152064,44.1325,40.7986,40.9769,SOGAY
896,238955,1906.2679,716.6485,441.8297,0.7873,551.5859,0.9604,248795,0.6954,1.6220,...,2.6769,2.6874,2.7991,-2.604860e+10,-21299822592,-19809978368,51.2267,45.7162,45.6260,SOGAY


In [12]:
dfp

Unnamed: 0,area,perimeter,major_axis_length,minor_axis_length,convex_area,equiv_diameter,eccentricity,solidity,extent,roundness,aspect_ratio,compactness,class
0,56276,888.242,326.1485,220.2388,56831,267.6805,0.7376,0.9902,0.7453,0.8963,1.4809,0.8207,Çerçevelik
1,76631,1068.146,417.1932,234.2289,77280,312.3614,0.8275,0.9916,0.7151,0.8440,1.7811,0.7487,Çerçevelik
2,71623,1082.987,435.8328,211.0457,72663,301.9822,0.8749,0.9857,0.7400,0.7674,2.0651,0.6929,Çerçevelik
3,66458,992.051,381.5638,222.5322,67118,290.8899,0.8123,0.9902,0.7396,0.8486,1.7146,0.7624,Çerçevelik
4,66107,998.146,383.8883,220.4545,67117,290.1207,0.8187,0.9850,0.6752,0.8338,1.7413,0.7557,Çerçevelik
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,79637,1224.710,533.1513,190.4367,80381,318.4289,0.9340,0.9907,0.4888,0.6672,2.7996,0.5973,Ürgüp Sivrisi
2496,69647,1084.318,462.9416,191.8210,70216,297.7874,0.9101,0.9919,0.6002,0.7444,2.4134,0.6433,Ürgüp Sivrisi
2497,87994,1210.314,507.2200,222.1872,88702,334.7199,0.8990,0.9920,0.7643,0.7549,2.2828,0.6599,Ürgüp Sivrisi
2498,80011,1182.947,501.9065,204.7531,80902,319.1758,0.9130,0.9890,0.7374,0.7185,2.4513,0.6359,Ürgüp Sivrisi


# Merging

In [13]:
common_cols = sorted(set(dfp.columns.tolist()) & set(dfd.columns.tolist()))
common_cols

['area',
 'aspect_ratio',
 'class',
 'compactness',
 'convex_area',
 'eccentricity',
 'equiv_diameter',
 'extent',
 'major_axis_length',
 'minor_axis_length',
 'perimeter',
 'roundness',
 'solidity']

In [14]:
df = pd.concat([dfd[common_cols], dfp[common_cols]])

# Viewing subsets

In [15]:
# Note that a single column will be returned as a Series object
dfp["area"]

0       56276
1       76631
2       71623
3       66458
4       66107
        ...  
2495    79637
2496    69647
2497    87994
2498    80011
2499    84934
Name: area, Length: 2500, dtype: int64

In [16]:
# View just the area and perimeter columns of pumpkin seeds
# Multiple columns are DataFrame objects
dfp[["area", "perimeter"]]

Unnamed: 0,area,perimeter
0,56276,888.242
1,76631,1068.146
2,71623,1082.987
3,66458,992.051
4,66107,998.146
...,...,...
2495,79637,1224.710
2496,69647,1084.318
2497,87994,1210.314
2498,80011,1182.947


In [17]:
# View the area and perimeter of "BERHI" dates
dfd.loc[dfd["class"] == "BERHI", ["area", "perimeter"]]

Unnamed: 0,area,perimeter
0,422163,2378.9080
1,338136,2085.1440
2,526843,2647.3940
3,416063,2351.2100
4,347562,2160.3540
...,...,...
60,445290,2435.3889
61,375936,2255.8411
62,439650,2447.7371
63,433338,2423.3010


In [18]:
# Sort BERHI dates by perimeter in descending order
subdf = dfd.loc[dfd["class"] == "BERHI", ["area", "perimeter"]]
subdf.sort_values(by="perimeter", ascending=False)

Unnamed: 0,area,perimeter
8,546063,2714.9480
2,526843,2647.3940
64,500669,2580.3069
55,497802,2561.2910
14,467092,2514.2429
...,...,...
1,338136,2085.1440
31,314313,2040.2850
23,306560,2015.1980
59,294273,2004.9600


# Normalization

In [25]:
df

Unnamed: 0,area,aspect_ratio,class,compactness,convex_area,eccentricity,equiv_diameter,extent,major_axis_length,minor_axis_length,perimeter,roundness,solidity
0,422163,1.2976,BERHI,0.8750,424428,0.6373,733.1539,0.7831,837.8484,645.6693,2378.908,0.9374,0.9947
1,338136,1.2161,BERHI,0.9065,339014,0.5690,656.1464,0.7795,723.8198,595.2073,2085.144,0.9773,0.9974
2,526843,1.3150,BERHI,0.8706,528876,0.6494,819.0222,0.7657,940.7379,715.3638,2647.394,0.9446,0.9962
3,416063,1.2831,BERHI,0.8791,418255,0.6266,727.8378,0.7759,827.9804,645.2988,2351.210,0.9458,0.9948
4,347562,1.3108,BERHI,0.8707,350797,0.6465,665.2291,0.7569,763.9877,582.8359,2160.354,0.9358,0.9908
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,79637,2.7996,Ürgüp Sivrisi,0.5973,80381,0.9340,318.4289,0.4888,533.1513,190.4367,1224.710,0.6672,0.9907
2496,69647,2.4134,Ürgüp Sivrisi,0.6433,70216,0.9101,297.7874,0.6002,462.9416,191.8210,1084.318,0.7444,0.9919
2497,87994,2.2828,Ürgüp Sivrisi,0.6599,88702,0.8990,334.7199,0.7643,507.2200,222.1872,1210.314,0.7549,0.9920
2498,80011,2.4513,Ürgüp Sivrisi,0.6359,80902,0.9130,319.1758,0.7374,501.9065,204.7531,1182.947,0.7185,0.9890


In [26]:
feature_cols = df.columns.drop("class")
X_df = df[feature_cols]

In [27]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")

X_df_trans = scaler.fit_transform(X_df)

In [28]:
X_df_trans

Unnamed: 0,area,aspect_ratio,compactness,convex_area,eccentricity,equiv_diameter,extent,major_axis_length,minor_axis_length,perimeter,roundness,solidity
0,2.552052,-0.083811,2.000216,2.504175,-2.361406,2.385535,1.265079,1.933669,2.595522,2.137744,1.918729,0.697721
1,1.796949,-0.092708,2.438857,1.751465,-3.205979,1.842734,1.207074,1.207154,2.219696,1.511971,2.515082,0.958265
2,3.492753,-0.081912,1.938945,3.424622,-2.211781,2.990793,0.984721,2.589213,3.114585,2.709670,2.026341,0.842467
3,2.497235,-0.085394,2.057309,2.449776,-2.493718,2.348064,1.149069,1.870796,2.592762,2.078742,2.044277,0.707371
4,1.881655,-0.082370,1.940338,1.855303,-2.247642,1.906755,0.842931,1.463077,2.127558,1.672183,1.894815,0.321379
...,...,...,...,...,...,...,...,...,...,...,...,...
2495,-0.526038,0.080159,-1.866788,-0.527735,1.307481,-0.537726,-3.476842,-0.007661,-0.794911,-0.320915,-2.119733,0.311729
2496,-0.615812,0.037998,-1.226233,-0.617314,1.011942,-0.683221,-1.681905,-0.454991,-0.784601,-0.619977,-0.965887,0.427527
2497,-0.450938,0.023741,-0.995076,-0.454406,0.874683,-0.422896,0.962164,-0.172878,-0.558443,-0.351581,-0.808952,0.437177
2498,-0.522677,0.042136,-1.329279,-0.523144,1.047802,-0.532462,0.528736,-0.206732,-0.688287,-0.409878,-1.352993,0.147683
