# Data import of polymer data

Data and methodology taken from: Estimation and Prediction of the Polymers’ Physical Characteristics Using the Machine Learning Models Polymers 2024, 16(1), 115; https://doi.org/10.3390/polym16010115. 

Github repository: https://github.com/catauggie/polymersML/tree/main

In [1]:
#import the pandas library
import pandas as pd

# Use pandas to import the polymer datafile into a dataframe
polyinfo = pd.read_excel('polyinfo homopolymer.xlsx')

In [19]:
#Print out the head of the dataframe to see if it looks ok
polyinfo.head()

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
0,polyethene,P010001,CH2,Physical property,Density,0.9362,0.010564,g/cm3
1,polyethene,P010001,CH2,Physical property,Specific volume,1.073,12.574191,cm3/g
2,polyethene,P010001,CH2,Optical property,Refractive index,1.531,0.00091,
3,polyethene,P010001,CH2,Thermal property,Crystallization kinetics r,79175.0,122911964.285714,nm/s
4,polyethene,P010001,CH2,Thermal property,Crystallization kinetics k,0.0617,0.58551,


In [18]:
# Determine the number of unique polymers in the dataset
polymers = list(polyinfo['polymer_name'].unique())
print(f'Number of polymers in the dataset = {len(polymers):,}')

Number of polymers in the dataset = 18,312


## Slice the data into smaller dataframes
This snippet of code iterates through a list of polymers, filters information from a DataFrame (`polyinfo`), and then appends the filtered information to a list named `poly_list`. Here's a breakdown of what each part of the code does:

1. `poly_list = []`: Initializes `poly_list` as an empty list. This list will store the filtered information for each polymer found in the `polymers` list.

2. `for p in range(len(polymers)):`: This line starts a loop that iterates over the indices of the `polymers` list. The `range(len(polymers))` generates a sequence of numbers from `0` to the length of the `polymers` list minus one, effectively iterating over each index in the `polymers` list.

3. `poly = polyinfo[polyinfo['polymer_name'] == polymers[p]]`: For each iteration of the loop, this line filters the `polyinfo` DataFrame for rows where the value in the 'polymer_name' column matches the current polymer name in the loop (`polymers[p]`). The result of this filtering (which could be one or more rows of data) is assigned to the variable `poly`.

4. `poly_list.append(poly)`: This line appends the filtered DataFrame `poly` to the `poly_list`. After the loop completes, `poly_list` will contain a list of DataFrames, where each DataFrame corresponds to a filtered view of `polyinfo` for each polymer name in the `polymers` list.

In summary, this code filters a large DataFrame (`polyinfo`) for specific rows matching each polymer name in a list (`polymers`), and collects these filtered DataFrames into a list (`poly_list`). This separats data specific to each polymer for use in subsequent operations.

In [21]:
poly_list = []
for p in range(len(polymers)):
    poly = polyinfo[polyinfo['polymer_name'] == polymers[p]]
    poly_list.append(poly)

In [24]:
# Show the dataframe in the list associated with the first entry (polyethene)
poly_list[0]

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
0,polyethene,P010001,CH2,Physical property,Density,0.9362,0.010564,g/cm3
1,polyethene,P010001,CH2,Physical property,Specific volume,1.073,12.574191,cm3/g
2,polyethene,P010001,CH2,Optical property,Refractive index,1.531,0.00091,
3,polyethene,P010001,CH2,Thermal property,Crystallization kinetics r,79175,122911964.285714,nm/s
4,polyethene,P010001,CH2,Thermal property,Crystallization kinetics k,0.0617,0.58551,
...,...,...,...,...,...,...,...,...
88,polyethene,P010001,CH2,Other physical property,Pvt relation pressure,55,42001.909205,MPa
89,polyethene,P010001,CH2,Other physical property,Pvt relation specific volume,1.127,0.0146,cm3/g
90,polyethene,P010001,CH2,Other physical property,Pvt relation temperature,180,4496404.537488,C
91,polyethene,P010001,CH2,Other physical property,Radiation resistance,26,39970.808057,Mrad


Use a similar method to the way we separated up the larger data frame to count the number of parameters we have for each polymer in the `poly_list`dataframe.  Append this to len_list

In [27]:
len_list = []
for pp in range(len(poly_list)):
    len_pp = len(poly_list[pp])
    len_list.append(len_pp)

Use the same iteration methodology to generate a list of polymer names present in `poly_list`

In [28]:
poly_names_list = []
for n in range(len(poly_list)):
    name_n = list(poly_list[n]['polymer_name'])[0]
    poly_names_list.append(name_n)

In [33]:
print(f'{len(poly_names_list):,}')

18,312


Extract a list of properties avaliable for each polymer in the list `poly_list``


In [34]:
poly_feats_list = []
for f in range(len(poly_list)):
    feat_n = list(poly_list[f]['property_name'])
    poly_feats_list.append(feat_n)

Take the three lists we produce a dictionary to aggregate information about polymers, including their names, the number of features (properties), and the names of these features. 


* **'polymer_name': poly_names_list:** Associates the key **'polymer_name'** with the list **poly_names_list**, which contains the names of polymers. This list was prepared by iterating over a list of DataFrames (`poly_list`), extracting the name of the polymer from each DataFrame, and storing these names.
* **'Number of features': len_list:** Links the key **'Number of features'** with the list **len_list**, which records the number of features for each polymer. This list was filled by iterating over `poly_list`, determining the length of the DataFrame (or specifically, the number of rows in the **'property_name'** column) for each polymer, and adding these lengths to **len_list**.
* **'Features names': poly_feats_list:** Connects the key **'Features names'** with the list **poly_feats_list**, containing lists of feature names for each polymer. This list of lists was compiled by iterating over `poly_list`, extracting all the feature names (values from the **'property_name'** column) for each polymer, and appending these lists to **poly_feats_list**.

In [36]:
df_poly_feats = {'polymer_name': poly_names_list, 
                 'Number of features': len_list,
                'Features names': poly_feats_list}

This dictionary is structured to be directly convertible into a pandas DataFrame.
The utility of creating a structured DataFrame (`df_poly_feats`) from the `poly_list` and associated information, rather than using `poly_list` directly, hinges on several factors that enhance data analysis, presentation, and manipulation capabilities. 

In [40]:
ddf = pd.DataFrame(df_poly_feats)
ddf.head()

Unnamed: 0,polymer_name,Number of features,Features names
0,polyethene,93,"[Density, Specific volume, Refractive index, C..."
1,poly(prop-1-ene),86,"[Density, Specific volume, Refractive index, C..."
2,poly(but-1-ene),41,"[Density, Specific volume, Refractive index, C..."
3,poly(pent-1-ene),16,"[Density, Specific volume, Glass transition te..."
4,poly(3-methylbut-1-ene),15,"[Density, Specific volume, Glass transition te..."


#### The advantage of structuring the data in this way:
1. **Structured Representation:**
Clarity and Accessibility: A DataFrame provides a tabular structure that is intuitive to work with. It clearly delineates polymer names, the number of features, and the features themselves in separate columns, making data access and manipulation straightforward.
Integration of Related Data: By consolidating polymer names, the number of features, and feature names into a single DataFrame, you establish a direct relationship among these elements. This integrated view facilitates analyses that involve multiple aspects of the data simultaneously.
2. **Enhanced Data Analysis Capabilities:**
Built-in Functions: pandas DataFrames support a wide array of built-in functions for data analysis, including statistical summaries, group-by operations, and pivot tables, which are not as readily accessible or efficient with a list of DataFrames or lists.
Filtering and Selection: Identifying and extracting data based on certain criteria (e.g., polymers with a specific number of features) is more straightforward with DataFrame operations.
3. **Ease of Data Manipulation:**
Modification: Adding, removing, or modifying data (such as adding another feature or updating feature names) is more efficient in a DataFrame structure. Changes can be applied across multiple rows or columns with simple commands.
Reshaping and Pivoting: Transforming the dataset to fit specific analysis needs (like pivoting or melting data for different views) is facilitated by DataFrame methods.
4. **Improved Data Visualization:**
Direct Plotting: pandas DataFrames integrate seamlessly with visualization libraries like matplotlib and seaborn, allowing for direct plotting of data. Visualizing the distribution of features across polymers or comparing the number of features between polymers can be achieved with minimal code.
5. **Data Export and Sharing:**
Interoperability: DataFrames can be easily exported to various formats (CSV, Excel, JSON) for sharing or further analysis, offering flexibility not inherently available with a custom list structure like poly_list.
6. **Error Reduction:**
Consistency: By structuring data into a DataFrame, you ensure a consistent format, which can reduce errors in data handling and analysis. The tabular format enforces a uniform structure, potentially catching inconsistencies in data types or missing values.