# Data 301 Final Project: Team Fortress 2 Unusual Data Analysis

## Introduction

**Team Fortress 2** is a first-person, class based shooter game developed by Valve Software. The game was originally released in 2007 and is best known for it's cartoon graphics, it's nine playable classes, and it's in-game item economy. In this project, I am going to take a closer look at data involving the game's rarest and most expensive items, **unusuals**.

![](./PresentationPictures/tf2.png) 

While the vast majority of Team Fortress 2 players play the game for it's 

This passon lead me to dedicate hours upon hours of my high school life bargining with other TF2 traders, in an attempt to get the best deal I could. In the span of a couple of years, I 

![](./PresentationPictures/channels4_banner.jpg) 

**Unusuals** in Team Fortress 2 work in the following way: 

- A couple times a year Valve adds a new crate. These crates contain cosmetic items you can wear on your characters, such as new hats or jackets. 
- Crates work like traditional loot boxes or booster packs. Each crate contain one randomized item, and in order to "unbox" it, a player must buy a "key" from the in-game store for approximatly $2.50. Each key works on only one crate and is then consumed. (This is how the game is monitized by Valve.)
- Most items in crates are worth less than fifty cents. However, you have an approximatly 1% chance to unbox an **unusual** item, an item with a rare particle effect surrounding it. These items can be worth anywhere from 10 dollars to 6,000 dollars.

**On the left is a regular hat. On the right is an unusual version of the hat.**

Each unusual is uniquely identified by the combination of a **cosmetic** and a **particle effect**. There are a plethora of different unusuals out there, and depending on how good the item looks, some unusuals may sell for massive amounts of money. Once an unusual sells, it is priced at that amount on the community website **backpack.tf**, priced using the in-game currency of keys (the same keys you use to unbox crates). Since each key costs 2.50, an unusual priced at 10 keys would be worth roughly 25 dollars. An unusual priced at 100 keys would be worth 250 dollars, and so on.

## The Goal of the Project

As a YouTuber, many people looked up to me in the trading community, and I would often be asked for trading advice. Some of the most common questions I got asked were the following:

- What is the **best crate** to unbox to maximize the chance of unboxing an expensive unusual?
- I just unboxed an unusual, but it doesn't have a community price yet. **How much is it worth?**

While I've always been able to give people rough answers to these questions, I've never been able to systematically determine the correct answer. Until now. Using as much data as I can find about every unusual that exists or could possibly exist, I am going to try and answer both of these questions in this project. 

Before I start wrangling the data, I think it is important to point out that while these two questions might seem very different from each other, it turns out they are inexplicably linked. You'll see why this is once I start visualizing the data, but for now just keep this in the back of your head.

Given the setup so far, my research objectives are as follows:
- Analyze the data to find the crate with the highest expected unusual value (EV).
- Build a machine learning model to predict the price of unpriced unusuals.

## Collecting & Cleaning the Data

Since I couldn't find any csv files containing the data I needed, I had to manually compile everything myself. This involved scraping many pages from the **Team Fortress 2 Wiki**, using a **JSON API** to download unusual prices (from the website Backpack.tf), and creating a **Google survey** to get additional data from experienced traders.


**In total, the following data was pulled, compiled, and cleaned.**

- List of all posible unusual cosmetics scraped from the TF2 Wiki [found here](https://wiki.teamfortress.com/wiki/Template:Unusual_quality_table).
- List of all possible unusual effects scraped from the TF2 Wiki [found here](https://wiki.teamfortress.com/wiki/Unusual).
- List of unusual effect IDs scraped from Backpack.tf [found here](https://backpack.tf/developer/particles). *Note:* These ID numbers are only needed to fetch pricing data from Backpack.tf and are not used in any machine learning models.  
- List of which unusual cosmetics can be "unboxed" out of each crate, scraped from a community-created guide, [found here](https://steamcommunity.com/sharedfiles/filedetails/?id=731640447). 
- List of additional descriptive data from each unusual, compiled by scraping each cosmetic's wiki page, [found here](https://wiki.teamfortress.com/wiki/).
- List of which cosmetics are "robo" hats (robotic versions of normal hats), scraped from the TF2 Wiki [found here](https://wiki.teamfortress.com/wiki/Robotic_Boogaloo).
- List of effect ratings (on a scale of 0 to 5), compiled via a Google Fourm sent out to experienced traders. The fourm can be [found here](https://docs.google.com/forms/d/e/1FAIpQLSd2kf5WsAKLIQfwrujgNKBFBhWeS0_ukJkfob3hzJ4-kw7XAA/viewform).
- Prices for each unusual from Backpack.tf's JSON API, [found here](https://backpack.tf/developer).

Once I had downloaded all of the data, I merged everything into a single master dataset. Since this dataset included all possible combinations of unusuals (even if they didn't exist) the dataset was massive, with over 41,000 observations!

![](./PresentationPictures/DatasetAtThisPoint.PNG) 

One thing I noticed about my dataset was that it is almost entierly categorical. While I knew `Price` and `Effect Community Rating` were going to be crucial for machine learning, I realized quickly that `Months Since Last Price Update` had no correlation to anything else in the dataset, leaving the dataset filled with 2 quantitative variables and almost 10 categorical variables. 

Unfortionatly, this large amount of categorical variables turned out to significantly hamper the performance of any K-Nearest Neighbors models I ran on the data. This was because there were some rare cosmetics in the dataset with only a few priced unusuals, causing KKN to fail to find enough reasonable neighbhors. For example, if I used a larger value of k such as 20 (which was important for predicting low value unusuals) and tried to predict the price of a cosmetic with only 5 priced unusuals, the model would find those 5 neigbhors and then grab 15 other neighbors that were quite a bit "further away", simply because it couldn't compare variables such as `Name` or `Effect` very well and wouldn't be able to find the "next best" neighbors.

To solve this issue, I created two new quantitative metrics, `Median Hat Rating` and `Median Effect Rating`  using existing variables. 

To define `Median Hat Rating`, I found the median value of each unusual **effect** and merged this data as an additional column in the dataset. Next, I took the difference between the price of each unusual and it's effect's median price, yeilding me a postive or negative value depending on if the unusual was priced higher than the median value for that effect. Using these numbers, I then  grouped all of these values by their **cosmetic (hat)** and took the median of these values. This yeilded me values for each cosmetic that were either positive or negative. A positive value indicated that each of the cosmetic's effects were higher than the average (thus better!), and a negative value indicated that each of each of the cosmetic's effects were lower than the average (thus worse). This gave me a tier order distinguishing the best cosmetics from the worst ones. 

To defined `Median Effect Rating`, I used the same theory was I used for `Median Hat Rating`, except I swapped effects for cosmetics and effects for cosmetics. This gave me a tier order for effects (similar to `Effect Community Rating`, as it turned out), that couuld distinguish the best effects from the worst effects.


![](./PresentationPictures/medianHatRating.PNG)
![](./PresentationPictures/medianEffectRating.PNG)

While the numbers don't look very pretty here, It doesn't really matter since I plan to standardized the data before fitting it to a machine learning model.

## Visualizing the Data

Because I had a lot of categorical variables, I found Altair plots to be a significant asset in visualizing my data, as I could display each variable via the size of points, color of points, or in different facets. 

![](./PresentationPictures/altairplot1.PNG)

In TF2 there are two main types of unusuals, unusual **hats (cosmetics)** and unusual **taunts**. This first plot displays the differences between the two types. It is immediatly noticible that:

1. There are far more hats than taunts.
2. Taunts are very inexpensive. No hat is worth less than around 8 keys, yet there are a significant number of taunts worth as little as 4 keys.

Since the points are colored based on the class (the playable character) the unusuals are for, you can also notice that unusuals that can be worn on more than one class (denoted in **orange** and **yellow**) are worth, on average, much more than unusuals that can only be worn on one class (denoted in **blue**).

Both this graph and the graph below also show that there is a roughly linear correlation between the price of an unusual and how high the community rated it's effect. This makes quite a bit of sense, since the market is driven by suppy and demand. If more people like the look of an unusual, it is probably going to sell for higher!

![](./PresentationPictures/altairplot2.PNG)

Just as there are far more unusual **hats** than **taunts**, there are more different types of **hats** than **taunts**. The plot above makes three distinctions in terms of types of hats. On the left are **Robotic** (**"Robo"** for short) hats. **Robo** hats are robotic versions of other hats already in the game, added during a game update theamed around robots. In general the community thinks that these hats are ugly compared to their normal counterparts, and this disdain is visualized through the lower prices of **robo** hats in the graph above.

On the other hand, unusual **miscs** (shown on the right) operate in an almost opposite fashion. Unusual **miscs** are unusuals that, due to their shape and size, can we worn at the same time as another unusual hat. Due to the fact that players can combine different unusual effects to create flashy particle displays, unusual **miscs** are in very high demand, something that can be visualized through the higher prices of hats in the graph above.

In 2015, Valve added an additional rarity system to crate-based items. Instead of every item having the same underlying rarity, there became four distinct rarities (called **grades**), **Mercenary**, **Commando**, **Assassin** and **Elite**. When unboxing a crate, you have about an **80%** chance of getting a **Mercenary** item, a **15%** chance of getting a **Commando** item, a **3%** chance of getting an **Assassin** item, and a **0.6%** chance of getting an elite item. When you remember that you have a **1%** chance of unboxing an unusual in the first place, it follows that the chance to pull an **unusual elite grade** item is 0.01 * 0.006 or as low as **0.006%!** As such, there are barley any **elite** items in existance (and subsequently in the dataset). The plot above reflects these rarities in the price of each item, as **Mercenary** grade items tend to be fairly cheap compared to **Assassin** and **Elite** grade items. 

Interestingly enough, you can also tell from the graph that the Robotic Boogaloo (**robo**) update occured *before* 2015, since there are no graded robo items. You can also tell that Valve has added few miscs to the game since 2015, as there are only a few **Commando** grade miscs and no miscs of any other grade.

![](./PresentationPictures/bestcratesfirst.PNG)