# Data import of polymer data

Data and methodology taken from: Estimation and Prediction of the Polymers’ Physical Characteristics Using the Machine Learning Models Polymers 2024, 16(1), 115; https://doi.org/10.3390/polym16010115. 

Github repository: https://github.com/catauggie/polymersML/tree/main

In [1]:
#import the pandas library
import pandas as pd

# Use pandas to import the polymer datafile into a dataframe
polyinfo = pd.read_excel('polyinfo homopolymer.xlsx')

In [19]:
#Print out the head of the dataframe to see if it looks ok
polyinfo.head()

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
0,polyethene,P010001,CH2,Physical property,Density,0.9362,0.010564,g/cm3
1,polyethene,P010001,CH2,Physical property,Specific volume,1.073,12.574191,cm3/g
2,polyethene,P010001,CH2,Optical property,Refractive index,1.531,0.00091,
3,polyethene,P010001,CH2,Thermal property,Crystallization kinetics r,79175.0,122911964.285714,nm/s
4,polyethene,P010001,CH2,Thermal property,Crystallization kinetics k,0.0617,0.58551,


In [18]:
# Determine the number of unique polymers in the dataset
polymers = list(polyinfo['polymer_name'].unique())
print(f'Number of polymers in the dataset = {len(polymers):,}')

Number of polymers in the dataset = 18,312


## Slice the data into smaller dataframes
This snippet of code iterates through a list of polymers, filters information from a DataFrame (`polyinfo`), and then appends the filtered information to a list named `poly_list`. Here's a breakdown of what each part of the code does:

1. `poly_list = []`: Initializes `poly_list` as an empty list. This list will store the filtered information for each polymer found in the `polymers` list.

2. `for p in range(len(polymers)):`: This line starts a loop that iterates over the indices of the `polymers` list. The `range(len(polymers))` generates a sequence of numbers from `0` to the length of the `polymers` list minus one, effectively iterating over each index in the `polymers` list.

3. `poly = polyinfo[polyinfo['polymer_name'] == polymers[p]]`: For each iteration of the loop, this line filters the `polyinfo` DataFrame for rows where the value in the 'polymer_name' column matches the current polymer name in the loop (`polymers[p]`). The result of this filtering (which could be one or more rows of data) is assigned to the variable `poly`.

4. `poly_list.append(poly)`: This line appends the filtered DataFrame `poly` to the `poly_list`. After the loop completes, `poly_list` will contain a list of DataFrames, where each DataFrame corresponds to a filtered view of `polyinfo` for each polymer name in the `polymers` list.

In summary, this code filters a large DataFrame (`polyinfo`) for specific rows matching each polymer name in a list (`polymers`), and collects these filtered DataFrames into a list (`poly_list`). This separats data specific to each polymer for use in subsequent operations.

In [21]:
poly_list = []
for p in range(len(polymers)):
    poly = polyinfo[polyinfo['polymer_name'] == polymers[p]]
    poly_list.append(poly)

In [24]:
# Show the dataframe in the list associated with the first entry (polyethene)
poly_list[0]

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
0,polyethene,P010001,CH2,Physical property,Density,0.9362,0.010564,g/cm3
1,polyethene,P010001,CH2,Physical property,Specific volume,1.073,12.574191,cm3/g
2,polyethene,P010001,CH2,Optical property,Refractive index,1.531,0.00091,
3,polyethene,P010001,CH2,Thermal property,Crystallization kinetics r,79175,122911964.285714,nm/s
4,polyethene,P010001,CH2,Thermal property,Crystallization kinetics k,0.0617,0.58551,
...,...,...,...,...,...,...,...,...
88,polyethene,P010001,CH2,Other physical property,Pvt relation pressure,55,42001.909205,MPa
89,polyethene,P010001,CH2,Other physical property,Pvt relation specific volume,1.127,0.0146,cm3/g
90,polyethene,P010001,CH2,Other physical property,Pvt relation temperature,180,4496404.537488,C
91,polyethene,P010001,CH2,Other physical property,Radiation resistance,26,39970.808057,Mrad


Use a similar method to the way we separated up the larger data frame to count the number of parameters we have for each polymer in the `poly_list`dataframe.  Append this to len_list

In [27]:
len_list = []
for pp in range(len(poly_list)):
    len_pp = len(poly_list[pp])
    len_list.append(len_pp)

Use the same iteration methodology to generate a list of polymer names present in `poly_list`

In [28]:
poly_names_list = []
for n in range(len(poly_list)):
    name_n = list(poly_list[n]['polymer_name'])[0]
    poly_names_list.append(name_n)

In [33]:
print(f'{len(poly_names_list):,}')

18,312


Extract a list of properties avaliable for each polymer in the list `poly_list``


In [34]:
poly_feats_list = []
for f in range(len(poly_list)):
    feat_n = list(poly_list[f]['property_name'])
    poly_feats_list.append(feat_n)

Take the three lists we produce a dictionary to aggregate information about polymers, including their names, the number of features (properties), and the names of these features. 


* **'polymer_name': poly_names_list:** Associates the key **'polymer_name'** with the list **poly_names_list**, which contains the names of polymers. This list was prepared by iterating over a list of DataFrames (`poly_list`), extracting the name of the polymer from each DataFrame, and storing these names.
* **'Number of features': len_list:** Links the key **'Number of features'** with the list **len_list**, which records the number of features for each polymer. This list was filled by iterating over `poly_list`, determining the length of the DataFrame (or specifically, the number of rows in the **'property_name'** column) for each polymer, and adding these lengths to **len_list**.
* **'Features names': poly_feats_list:** Connects the key **'Features names'** with the list **poly_feats_list**, containing lists of feature names for each polymer. This list of lists was compiled by iterating over `poly_list`, extracting all the feature names (values from the **'property_name'** column) for each polymer, and appending these lists to **poly_feats_list**.

In [36]:
df_poly_feats = {'polymer_name': poly_names_list, 
                 'Number of features': len_list,
                'Features names': poly_feats_list}

This dictionary is structured to be directly convertible into a pandas DataFrame.
The utility of creating a structured DataFrame (`df_poly_feats`) from the `poly_list` and associated information, rather than using `poly_list` directly, hinges on several factors that enhance data analysis, presentation, and manipulation capabilities. 

In [40]:
ddf = pd.DataFrame(df_poly_feats)
ddf.head()

Unnamed: 0,polymer_name,Number of features,Features names
0,polyethene,93,"[Density, Specific volume, Refractive index, C..."
1,poly(prop-1-ene),86,"[Density, Specific volume, Refractive index, C..."
2,poly(but-1-ene),41,"[Density, Specific volume, Refractive index, C..."
3,poly(pent-1-ene),16,"[Density, Specific volume, Glass transition te..."
4,poly(3-methylbut-1-ene),15,"[Density, Specific volume, Glass transition te..."


#### The advantage of structuring the data in this way:
1. **Structured Representation:**
Clarity and Accessibility: A DataFrame provides a tabular structure that is intuitive to work with. It clearly delineates polymer names, the number of features, and the features themselves in separate columns, making data access and manipulation straightforward.
Integration of Related Data: By consolidating polymer names, the number of features, and feature names into a single DataFrame, you establish a direct relationship among these elements. This integrated view facilitates analyses that involve multiple aspects of the data simultaneously.
2. **Enhanced Data Analysis Capabilities:**
Built-in Functions: pandas DataFrames support a wide array of built-in functions for data analysis, including statistical summaries, group-by operations, and pivot tables, which are not as readily accessible or efficient with a list of DataFrames or lists.
Filtering and Selection: Identifying and extracting data based on certain criteria (e.g., polymers with a specific number of features) is more straightforward with DataFrame operations.
3. **Ease of Data Manipulation:**
Modification: Adding, removing, or modifying data (such as adding another feature or updating feature names) is more efficient in a DataFrame structure. Changes can be applied across multiple rows or columns with simple commands.
Reshaping and Pivoting: Transforming the dataset to fit specific analysis needs (like pivoting or melting data for different views) is facilitated by DataFrame methods.
4. **Improved Data Visualization:**
Direct Plotting: pandas DataFrames integrate seamlessly with visualization libraries like matplotlib and seaborn, allowing for direct plotting of data. Visualizing the distribution of features across polymers or comparing the number of features between polymers can be achieved with minimal code.
5. **Data Export and Sharing:**
Interoperability: DataFrames can be easily exported to various formats (CSV, Excel, JSON) for sharing or further analysis, offering flexibility not inherently available with a custom list structure like poly_list.
6. **Error Reduction:**
Consistency: By structuring data into a DataFrame, you ensure a consistent format, which can reduce errors in data handling and analysis. The tabular format enforces a uniform structure, potentially catching inconsistencies in data types or missing values.

In [42]:
# We export this data just as a nice check point and to have it handy wen needed
ddf.to_excel("polymers_number_of_features.xlsx", index=True)

In [44]:
# And import it back just because we can and check the first rows 
df = pd.read_excel('polymers_number_of_features.xlsx')
df.head()

Unnamed: 0.1,Unnamed: 0,polymer_name,Number of features,Features names
0,0,polyethene,93,"['Density', 'Specific volume', 'Refractive ind..."
1,1,poly(prop-1-ene),86,"['Density', 'Specific volume', 'Refractive ind..."
2,2,poly(but-1-ene),41,"['Density', 'Specific volume', 'Refractive ind..."
3,3,poly(pent-1-ene),16,"['Density', 'Specific volume', 'Glass transiti..."
4,4,poly(3-methylbut-1-ene),15,"['Density', 'Specific volume', 'Glass transiti..."


In [45]:
# Do some checking to ensure the df has the same data as in poly_list
df['Features names'][5]

"['Density', 'Specific volume', 'Glass transition temperature', 'Melting temperature', 'Specific heat capacity cp', 'Specific heat capacity cv', 'Gas permeability coefficient p', 'Intrinsic viscosity eta', 'Radius of gyration', 'Softening temperature', 'SMILES']"

Output the fifth df member of poly_list so that we can check our methods of manipulation below.

In [46]:
poly_list[5]

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
244,poly(hex-1-ene),P010007,C6H12,Physical property,Density,0.854,0.003341,g/cm3
245,poly(hex-1-ene),P010007,C6H12,Physical property,Specific volume,1.182,0.009742,cm3/g
246,poly(hex-1-ene),P010007,C6H12,Thermal property,Glass transition temperature,-48.3,61.255282,C
247,poly(hex-1-ene),P010007,C6H12,Thermal property,Melting temperature,30,7168.266667,C
248,poly(hex-1-ene),P010007,C6H12,Thermal property,Specific heat capacity cp,0.4586,0.017784,cal/(g*C)
249,poly(hex-1-ene),P010007,C6H12,Thermal property,Specific heat capacity cv,0.291,0.0,cal/(g*C)
250,poly(hex-1-ene),P010007,C6H12,Physicochemical property,Gas permeability coefficient p,0.0,0.0,cm3(STP)cm/(cm2*s*Pa)
251,poly(hex-1-ene),P010007,C6H12,Dilute solution property,Intrinsic viscosity eta,0.29,2.13798,dl/g
252,poly(hex-1-ene),P010007,C6H12,Dilute solution property,Radius of gyration,9.83,0.007707,nm
253,poly(hex-1-ene),P010007,C6H12,Heat characteristics,Softening temperature,-36,0.0,C


Access the first feature name in the list stored in the Features names column of the sixth row, concatenate '_cfdf' to this feature name, it returns the concatenated string.

In [51]:
# Check that we can concatenate to dataframes
ddf['Features names'][5][0] +'_' +'cfdf'

'Specific heat capacity cv_cfdf'

Access a specific DataFrame within the list of DataFrames (`poly_list`), and then filter that DataFrame based on a condition involving the `property_name` column. Specifically, filter the sixth DataFrame in `poly_list `(since Python uses zero-based indexing, poly_list[5] refers to the sixth element) for rows where the value in the `property_name` column matches the fifth feature name in the list of feature names stored in the sixth row of the `ddf` DataFrame's Features names column. 

Here's a breakdown:

* poly_list[5]: This accesses the sixth DataFrame within poly_list.
* poly_list[5]['property_name']: This selects the property_name column of the sixth DataFrame.
* ddf['Features names'][5][4]: This retrieves the fifth feature name from the list of feature names in the sixth row of ddf's Features names column. *Remember, both indices are zero-based, so this expression points to what would commonly be referred to as the fifth feature name of the sixth polymer.*

The complete expression filters the sixth DataFrame in poly_list to include only those rows where the property_name matches the specified feature name from ddf.

In [49]:
#Check that we can nicely access certain parts of the dataframe
poly_list[5][poly_list[5]['property_name'] == ddf['Features names'][5][4]]

Unnamed: 0,polymer_name,polymer_id,cu_formula,property_category,property_name,property_value_median,property_value_variance,property_unit
248,poly(hex-1-ene),P010007,C6H12,Thermal property,Specific heat capacity cp,0.4586,0.017784,cal/(g*C)


This code block iterates over the feature names listed for the sixth polymer in ddf['Features names'][5] and extracts specific statistics and units for each feature from the corresponding sixth DataFrame in poly_list. It organizes this extracted data into separate lists for median values, variance values, and units of the properties, along with their associated names. 

Here's a detailed explanation of each part:

* Initialization of Lists: Six lists are initialized to store the median values, variance values, and units for the features, along with their corresponding names. These lists will be populated with the extracted data for each feature of the polymer.
* Looping Over Feature Names: The loop iterates over each feature name for the sixth polymer in ddf.
* Extracting Median Values: For each feature, the code filters the sixth DataFrame in `poly_list` to rows where the `property_name` matches the current feature name being processed. It then selects the `property_value_median column`, converts it to a list, and takes the first element. This value is the median value of the property. A corresponding name for this median value is constructed by appending `_value_median` to the feature name and stored in `feat_median_values_name_l`.
* Extracting Variance Values: Similarly, the variance value for the property is extracted by filtering on the same feature name and selecting the `property_value_variance` column. The first element of this list is the variance value of the property. A corresponding name for this variance value is created by appending `_value_variance` to the feature name and stored in `feat_variance_values_name_l`.
* Extracting Units: The unit for the property is extracted by filtering on the feature name and selecting the `property_unit` column. The first element of this list is the unit of the property. A corresponding name for this unit is constructed by appending `_unit` to the feature name and stored in `feat_unity_name_l`.
* Appending Extracted Data to Lists: The extracted median values, variance values, and units are appended to their respective lists (`feat_median_values_l`, `feat_variance_values_l`, `feat_unity_l`). The constructed names for these values are also appended to their respective name lists.

In [52]:
feat_median_values_l = []
feat_variance_values_l = []
feat_unity_l = []

feat_median_values_name_l = []
feat_variance_values_name_l = []
feat_unity_name_l = []

for l in range(len(ddf['Features names'][5])):
    
    feat_median_l = list(poly_list[5][poly_list[5]['property_name'] == ddf['Features names'][5][l]]['property_value_median'])[0]
    feat_median_name_l = ddf['Features names'][5][l] + '_value_median'
    
    feat_variance_l = list(poly_list[5][poly_list[5]['property_name'] == ddf['Features names'][5][l]]['property_value_variance'])[0]
    feat_variance_name_l = ddf['Features names'][5][l] + '_value_variance'
    
    feat_property_unit_l = list(poly_list[5][poly_list[5]['property_name'] == ddf['Features names'][5][l]]['property_unit'])[0]
    feat_unit_name_l = ddf['Features names'][5][l] + '_unit'    
    
    feat_median_values_l.append(feat_median_l)
    feat_variance_values_l.append(feat_variance_l)
    feat_unity_l.append(feat_property_unit_l)
    
    feat_median_values_name_l.append(feat_median_name_l)
    feat_variance_values_name_l.append(feat_variance_name_l)
    feat_unity_name_l.append(feat_unit_name_l)

Now we can apply this to all polymers in the list

In [53]:
median_all = []
variance_all = []
unity_all = []

median_names_all = []
variance_names_all = []
unity_names_all = []

for s in range(len(poly_list)):
    feat_median_values_l = []
    feat_variance_values_l = []
    feat_unity_l = []

    feat_median_values_name_l = []
    feat_variance_values_name_l = []
    feat_unity_name_l = []

    for l in range(len(ddf['Features names'][s])):

        feat_median_l = list(poly_list[s][poly_list[s]['property_name'] == ddf['Features names'][s][l]]['property_value_median'])[0]
        feat_median_name_l = ddf['Features names'][s][l] + '_value_median'

        feat_variance_l = list(poly_list[s][poly_list[s]['property_name'] == ddf['Features names'][s][l]]['property_value_variance'])[0]
        feat_variance_name_l = ddf['Features names'][s][l] + '_value_variance'

        feat_property_unit_l = list(poly_list[s][poly_list[s]['property_name'] == ddf['Features names'][s][l]]['property_unit'])[0]
        feat_unit_name_l = ddf['Features names'][s][l] + '_unit'   

        feat_median_values_l.append(feat_median_l)
        feat_variance_values_l.append(feat_variance_l)
        feat_unity_l.append(feat_property_unit_l)

        feat_median_values_name_l.append(feat_median_name_l)
        feat_variance_values_name_l.append(feat_variance_name_l)
        feat_unity_name_l.append(feat_unit_name_l)
    
    median_all.append(feat_median_values_l)
    variance_all.append(feat_variance_values_l)
    unity_all.append(feat_unity_l)
    
    median_names_all.append(feat_median_values_name_l)
    variance_names_all.append(feat_variance_values_name_l)
    unity_names_all.append(feat_unity_name_l)


We are iterating over each polymer in `poly_list` and for each feature of that polymer as listed in `ddf['Features names']`. For each feature, we calculate three specific properties: median value, variance, and unit. These values are then stored in separate lists (`median_all`, `variance_all`, `unity_all`) along with their corresponding names (`median_names_all`, `variance_names_all`, `unity_names_all`). 

Here's a step-by-step explanation of what this code does:

* Initialization: For each polymer in poly_list, it initializes six temporary lists to hold the median values, variance values, and units for each property, as well as the names for these categories.
* Nested Loop:
    The outer loop [s] iterates over each polymer in `poly_list`.
    The inner loop [l] iterates over each feature name for the current polymer as specified in `ddf['Features names']`.
* Data Extraction: Within the inner loop, for each feature: It filters the current polymer's DataFrame (`poly_list[s]`) for rows where the property_name matches the current feature name (`ddf['Features names'][s][l]`). It extracts the median value, variance, and unit of the property from the filtered DataFrame and also constructs the names for these metrics by appending suffixes (`_value_median`, `_value_variance`, _`unit`) to the feature name.
* Appending Values: The extracted values and their constructed names are appended to the respective temporary lists.
* Aggregating Results: After processing all features of a polymer, the temporary lists are appended to the corresponding "all" lists (`median_all`, `variance_all`, `unity_all`, `median_names_all`, `variance_names_all`, `unity_names_all`), which aggregate the values and names across all polymers.

This code iterates over the length of poly_list, list of DataFrames. For each polymer, it zips together names and values for medians, variances, and units, then converts these zipped lists into dictionaries. These dictionaries are appended to separate lists (`medians`, `variances`, `units`) for each property type. 

By zipping the names and values together, you make the data more accessible for operations that require understanding the relationship between a feature's name and its numerical value or unit, such as generating reports, performing detailed analyses, or creating visualizations that require labeled data points.

In [54]:
medians = []
variances = []
units = []

for v in range(len(poly_list)):
    median_v = list(zip(median_names_all[v], median_all[v]))  
    variance_v = list(zip(variance_names_all[v], variance_all[v]))
    unity_v =  list(zip(unity_names_all[v], unity_all[v]))
    
    medians.append(dict(median_v))
    variances.append(dict(variance_v))
    units.append(dict(unity_v))

Here’s what each step does:

* Zipping names and values: The zip function is used to pair each feature's name (with a suffix indicating the property type, such as `_value_median`, `_value_variance`, or `_unit`) with its corresponding value (`median`, `variance`, or `unit`) for the current polymer (`v`). This results in a list of tuples for medians, variances, and units, where each tuple contains a name-value pair.
* `median_v = list(zip(median_names_all[v], median_all[v]))`: Creates a list of tuples for median values and their names. 
* `variance_v = list(zip(variance_names_all[v], variance_all[v]))`: Creates a list of tuples for variance values and their names.
* `unity_v = list(zip(unity_names_all[v], unity_all[v]))`: Creates a list of tuples for unit names and their values.
Converting to dictionaries: Each list of tuples is then converted into a dictionary using dict(). This conversion maps each feature's name (as the key) to its corresponding value (as the dictionary value). This is done for median values, variance values, and units, creating three dictionaries for each polymer.
medians.append(dict(median_v)): Appends a dictionary of median values (and their corresponding feature names) to the medians list.
variances.append(dict(variance_v)): Appends a dictionary of variance values (and their corresponding feature names) to the variances list.
units.append(dict(unity_v)): Appends a dictionary of units (and their corresponding feature names) to the units list.
Appending to lists: Finally, each dictionary is appended to its corresponding list (medians, variances, units), creating a collection of dictionaries for each polymer. Each list (medians, variances, units) will contain as many dictionaries as there are polymers in poly_list, with each dictionary representing the median values, variance values, and units of properties for a single polymer.