## D214 Capstone: Time Series Forecasting in Processor Core Count
##### Submitted By Edwin Perry
### Table of Contents
<ol>
    <li><a href="#A">Research Question</a></li>
    <li><a href="#B">Data Collection</a></li>
    <li><a href="#C">Data Extraction and Preparation</a></li>
    <li><a href="#D">Analysis</a></li>
    <li><a href="#E">Data Summary and Implications</a></li>
    <li><a href="#F">Sources</a></li>
</ol>
<h4 id="A">Research Question</h4>
<p>The research question I decided to investigate for this project is "Can a time-series analysis be used to accurately predict the increase in processor cores based solely on publicly available data?" <br />
Processing power is increasingly relevant, as technology becomes more complex and requires more resources to operate. The number of cores is one key metric in determining a processor's capacity, as more cores allows for a greater degree of parallel processing and multithreading. Furthermore, the materials used to produce these processors is incredibly difficult to acquire, so being able to forecast the necessity of these materials through understanding how core usage increases would be valuable to any entity through which processor materials may be sourced. <br />
Intel and AMD, two of the largest processor manufacturers, make public certain pieces of information regarding their processors, such as time of release and number of cores. With this information, we can attempt to build a predictive model that will be able to chart the projected growth of cores in computer processors. This type of analysis will be useful to a number of businesses. Any business using high-end, complex technologies such as generative AI can benefit from the ability to forecast the increase in computational power, allowing these businesses to scale their technology as more powerful resources become available. Businesses providing the raw materials can utilize the forecasted increase of cores to project how quickly resources will need to be harvested for use over the course of time. Even the producers of the processors could use this information to understand the power new processors would be expected to have, helping the processor manufacturers to establish benchmark growth standards in the quality of processors produced. <br />
This analysis will utilize a null and alternate hypothesis to analyze the results. The hypotheses are as follows:
<ul>
<li>Null Hypothesis: The ARIMA analysis produced by the model will not be sufficiently accurate in determining core count in processors with a mean absolute percent error below 15%</li>
<li>Alternate Hypothesis: The ARIMA analysis produced by the model will be sufficiently accurate in determining core count in processors with a mean absolute percent error below 15%</li>
</ul>
</p>
<h4 id="B">Data Collection</h4>
<p>The data for this project was sourced from a <a href='https://www.kaggle.com/datasets/alanjo/amd-processor-specifications/data?select=INTELpartialspecs_adjusted.csv'>publicly available Kaggle dataset</a>. This page compiled data from both AMD and Intel into tabular structures for the purpose of analytics. The tables contain information that covers a time span of 1999-2022, and is available in 2 distinct csv files. There are a total 3342 rows, each representing a different processor created. The dataset documents the following values:
<ul>
<li>product: The name of the processor</li>
<li>releaseDate: The year that the processor was launched</li>
<li># cores: The number of cores the processor has</li>
<li># threads: The number of threads the processor can handle</li>
<li># maxTurboClock: The maximum clock speed of the processor, in gigaherts</li>
<li># baseClock: The base clock speed of the processor, in gigaherts</li>
<li># cache: The memory cache size of the processor, in megabytes</li>
<li>cacheInfo: The type of cache that the processor uses</li>
<li># TDP: The thermal design power, in Watts</li>
<li># lith: The lithography of the processor, in nanometers</li>
<li>status: The current status of processor manufacturing</li>
<li>IntegratedG: Model of integrated graphics card, if any</li>
</ul>
The data will be dowloaded in 2 csv files, then ingested using the read_csv function, which adds the data into a dataframe. The tables will then be concatenated to each other to create one comprehensive dataframe containing all of the relevant data.<br />
The primary disadvantage of performing the data analysis with this dataset is the limited amount of different values in the releaseDate column. If we had a greater number of years, or if the releaseDate column contained month as well as year, then we would have a larger amount of data to utilize for the purpose of this analysis. Instead, we will have 23 different years to look at, with no ability to break the data into smaller pieces on the basis of quarter, month, or date. As such, we may not be able to sufficiently account for the variability of the data. <br />
The primary advantage of this data collection and analysis is the simplicity of the analysis, allowing for even non-technical individuals to understand the process simply. The process consists of downloading csv files, loading them into virtual tables (in a dataframe), and combining the data. This is an easily replicated process, whereby anyone can download the data and perform the same steps, enabling others to validate the results of the data collection process. There are alternatives involving APIs or using external databases that can make the data collection aspect more difficult to understand, but that is not the case with this analysis.</p>
<h4 id="C">Data Extraction and Preparation</h4>
