# **Data Bias and Feature Importance**

## **Statistical Bias and Feature Importance**

### **Statistical Bbias**

![](2023-12-29-11-45-47.png)

Statistical bias is a tendency of a statistic to either overestimate or underestimate a parameter. A data set could be biased if it contains disproportionately large number of reviews for, let's say, one product category called A and fewer number of reviews for other categories like product category B and C. 

The first one that we see here is activity bias. This is biases that exist in human-generated content, especially on social media. Think about all the data that has been collected over these social media platforms over the last several years. Reality is that a very small percentage of the population is actively participating on these social media platforms. So the data that has been collected over the years on these platforms is not representative of the entire population. The second one is very similar but slightly different. This is societal bias. This is once again, biases in data that is generated by humans, but maybe not just on social media. These biases could be introduced because of preconceived notions that exist in society. Data generated by humans can be biased because all of us have unconscious bias. Sometimes bias can be introduced by the machine learning system itself. Let's say, for example, a machine learning application gives users a few options to select from, and once the user selects an option, the user selection is used as training data to further train and improve the model. This introduces feedback loops. Take, for example, a streaming service. You want to watch a movie on the streaming service and the streaming service makes a few recommendations for you, and you decide to watch Dancing with Wolves. You like the movie and you rate it high. From then on, the streaming service is recommending you the movies that have wolves in them. It's partly because of the feedback you provided to the service. But in reality, maybe you watched that movie because you like the actors in the movie, and you don't even particularly like wolves. Situations like this could result in selection bias, that includes a feedback loop, that involves both the model consumers and the machine learning model itself. Now, even if you detect some of the statistical biases in your dataset prior to training your model, once the model is trained and deployed, drift can still happen. Data drift happens, especially when the data distribution significantly varies from the distribution of the training data that was used to initially train the model. This is called data drift and also data shift. There are several different variations of data drift.


![](2023-12-29-11-55-53.png)

Sometimes the distribution of the independent variables or the features that make up your dataset can change. That's called covariant drift. Sometimes the data distribution of your labels or the targeted variables might change. That's the second one, which is prior probability drift. Sometimes the relationship between the two, that is the relationship between the features and the labels can change as well. That's called concept drift. Concept drift, also called as concept shift, can happen when the definition of the label itself changes based on a particular feature, like age or geographical location. Take, for example, my experience. Last time, when we traveled a few years ago across US on a road trip, we quickly found out that the soft drinks are not called the same across US. So when we stopped for meals and ordered soft drinks, we realized that soda is not called soda across US. In some areas, it's called pop, and in some areas, it's called soda. Now, if you think about all the geographies across the world, you can only imagine the interesting combinations, different labels, you can come up with. With all these issues that could potentially happen with your datasets, it becomes really important to continuously monitor and detect various biases that could be prevalent into your training datasets, before and after you train your models. In this section, we will focus on detecting such biases and imbalances in the pre-training datasets. 

![](2023-12-29-11-58-52.png)

![](2023-12-29-12-00-34.png)

The next metric that I will introduce is DPL, this is Difference in Proportions of Labels. This metric measures the imbalance of positive outcomes between the different facet values. When applied to the product review dataset, what this metric is measuring is if a particular product category, say product Category A, has disproportionately higher ratings than other categories. So while CI, the metric that we just saw as measuring if a particular category has a total number of reviews higher than any other categories, DPL is actually looking for higher ratings than any other product categories. 

![](2023-12-29-12-12-17.png)

![](2023-12-29-12-12-57.png)

![](2023-12-29-12-13-26.png)

![](2023-12-29-12-14-11.png)

![](2023-12-29-12-15-10.png)

![](2023-12-29-12-16-06.png)

![](2023-12-29-12-17-10.png)

By using two parameters, instance type and instance count, you can scale up the distributed cluster to the capacity that you need. 

![](2023-12-29-12-22-29.png)

![](2023-12-29-12-25-01.png)

Next step is to configure the data config object on the Clarify library. The data config object represents the details about your data. So as you can expect, it has the input and output location of your data, in S3, as well as the label that you're trying to predict, using that dataset. In this case here, that label that we are trying to predict is sentiment. Next, you configure the bias config object on Clarify library. The bias config object captures the facet or the featured name that you are trying to evaluate for bias or imbalances. In this case, you're trying to find out imbalances in the product category feature. The parameter label values or threshold defines the desired values for the labels. So if the sentiment feature is your label, what is the desired value for that label? That value goes into the parameter label values or threshold. Once you have configured those three objects, you are ready to run the pre-training bias method on the Clarify processor. In addition to specifying the data config and the data bias config that you already configured, you can also specify the methods that you want to evaluate for bias. So, these methods are basically the metrics that you've already learned about to detect bias. The metrics here are the CI, the class imbalance, and the DPL. You can also specify a few other methods here as well. The wait parameter specifies whether this bias detection job should block the rest of your code or should it be executed in the background. Similarly, logs parameter specify whether you want to capture the logs or not. Once the configuration of the pre-training bias method is done, you launch this job. In the background, SageMaker Clarify is using a construct called SageMaker Processing Job to execute the bias detection at scale. SageMaker Processing Jobs is a construct that allows you to perform any data-related tasks at scale. These tasks could be executing pre-processing, or post-processing tasks, or even using data to evaluate your model. As you can see in the figure here, the SageMaker Processing Job expects the data to be in an S3 bucket. The data is collected from the S3 bucket and processed on this processing cluster which contains a variety of containers in the cluster. By default, containers for Sklearn, Python, and a few others are supported. You can also have the opportunity to bring your own custom container as well. Once the processing cluster has processed the data, the transformed data or the processed data is put back in the S3 bucket.

![](2023-12-29-12-32-01.png)

![](2023-12-29-12-32-29.png)

![](2023-12-29-12-33-15.png)

![](2023-12-29-12-36-47.png)

The first option, Data Wrangler, provides you with more of a UI-based visual experience. So, if you would like to connect to multiple data sources and explore your data in more visual format, and configure what goes into your bias reports by making selections from drop-down boxes and option buttons, and finally, launch the bias detection job using a button click, Data Wrangler is the tool for you. Keep in mind that Data Wrangler is only using a subset of your data to detect bias in that data set. On the other hand, SageMaker Clarify provides you with more of an API-based approach. Additionally, Clarify also provides you with the ability to scale out the bias detection process. SageMaker Clarify uses a construct called processing jobs that allow you to configure a distributed cluster to execute your bias detection job at scale. So, if you're thinking of large volumes of data, for example, millions of millions of rows of product reviews, and you want to explore that data set for bias, then SageMaker Clarify is the tool for you, so that you can take advantage of the scale and capacity offered by Cloud. 

### **Feature Importance**

![](2023-12-29-12-39-08.png)

![](2023-12-29-12-40-43.png)

Using the SHAP framework, you can provide both local and global explanations. While the local explanation focuses on indicating how an individual feature contributes to the final model, the global explanation takes a much more comprehensive view in trying to understand how the data in its entirety contributes to the final outcome from the machine learning model. SHAP framework is also very extensive in nature, in that it considers all possible combinations of feature values along with all possible outcomes for your machine learning model. Because of this extensive nature, the SHAP framework could be very time intensive, but also because of this extensive nature, SHAP can provide you with guarantees in terms of consistency and local accuracy. In this video, I will demonstrate how to use Data Wrangler to calculate feature importance on your data set. To get started, you start from the Amazon SageMaker Studio homepage, right here, and you go to the New data flow section of the launch page here, and let's go ahead and start a new data flow. 

![](2023-12-29-12-41-56.png)

![](2023-12-29-12-42-22.png)

![](2023-12-29-12-43-35.png)

![](2023-12-29-12-43-56.png)

![](2023-12-29-12-44-19.png)

![](2023-12-29-12-44-42.png)

![](2023-12-29-12-44-54.png)

![](2023-12-29-12-45-16.png)

![](2023-12-29-12-45-25.png)

When I select the right CSV, you get a little preview of the features, or the columns, that are included in that CSV file. I'll quickly take a look at these files. I can see that the reviewer age is present, the review title, the review text is there, as well as the star rating. So, this is the rating that the reviewer has assigned to this particular product. And I have a column called recommended indicator, which basically is a Boolean value that says whether the reviewer is recommending this product to others or not. I also have the positive feedback count. So this is an indication of how many positive reviews a particular product has got. And I have a few other columns here that indicate the division name, the department name, and the class name. 

![](2023-12-29-12-47-21.png)

![](2023-12-29-12-47-37.png)

![](2023-12-29-12-47-59.png)

![](2023-12-29-12-48-16.png)

![](2023-12-29-12-48-32.png)

![](2023-12-29-12-48-56.png)

![](2023-12-29-12-49-11.png)

![](2023-12-29-12-49-38.png)

![](2023-12-29-12-49-56.png)

![](2023-12-29-12-50-16.png)

![](2023-12-29-12-50-42.png)

![](2023-12-29-12-50-58.png)

![](2023-12-29-12-51-21.png)

![](2023-12-29-12-51-57.png)

![](2023-12-29-12-52-37.png)

![](2023-12-29-12-53-26.png)

![](2023-12-29-12-53-41.png)

I would think the positive feedback count will contribute to your final model when it's trying to credit the star rating of a product, but this is just a starting point here. So, after you see this feature importance, maybe you have an opportunity to go back and fix your data set and do more feature engineering. Feature engineering tasks, like dropping some of the columns that are not contributing here at all, so that you can reduce the dimensions of your training data set and perform training much faster. There is also an opportunity to combine fields, maybe combining the positive feedback count and the recommended indicator, to arrive at a new feature could improve your F1 score or the final model itself. So, based on what the feature importance score is showing you, you can go back and do more feature engineering tasks on your data set. Now that I have had a chance to look through the F1 scores and did a little bit of analysis, I will go ahead and create the entire analysis here. And the reason for creating this new analysis here is that the next time I want to look at the data and F1 scores, I don't have to start from the scratch. I can simply use this persistent analysis and get back to this screen right here. 

![](2023-12-29-12-55-38.png)

![](2023-12-29-12-55-57.png)