<h1>Code Snippets for Parsing the Tags Datasets of Technology Related Stack Exchange Sites (D1 and D2)</h1>

<p><b>Step1:</b> <u>Identifying the technology related Stack Exchange based sites from the <a href='https://stackexchange.com/sites#technology-oldest'>official site</a> of Stack Exchange. </u>At the time of writing, Stack Exchange Network hosts 182 sites out of which 77 are listed under Technology tab including Stack Overflow and Software Recommendations. We excluded the Russian, Japanese, 
Spanish and Portuguese variants of Stack Overflow and the Meta Stack Exchang 
which deals with meta-discussions related to Stack Exchange based Q&A Websitesand considered the remaining 71 sites as technology related sites., </p> 

<p><b>Step2:</b> <u>Downloading the complete set of Tags used on each site from the  <a href='https://data.stackexchange.com/stackoverflow/query/new'>Stack Exchange Data Explorer (SEDE)</a></u>The SEDE interface allows its users to retrieve data using SQL queries from any of its sites. We selected the technology related sites one by one and used the following simple SQL query to retrieve the set of tags corresponding to the selected site.   The user interface allows users to download data returned through their queries in CSV format so we saved it. The Stack Overflow tags were not downloaded as we considered them from the complete data dump processed as part of D3.</p> 

<p><b>Step3:</b> <u>Saving the datasets in database tables for further processing.</u> Instead of manually importing each CSV file for the 70 technology related sites, we used the following code snippet to create the corresponding tables for each of the sites and then insert data from CSV files to the database tables. <i>Don't forget to create database <code>StackExchangeTagsDb_July2024</code> using SSMS before executing this snippet.</i></p> 

In [None]:
import pandas as pd
import pyodbc
import glob
import os
server = 'DESKTOP-DEK23E9'
database = 'StackExchangeTagsDb_July2024' 


cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';')
cursor = cnxn.cursor()

tags_directory_path =r'E:\\Dataset_2024\\stackexchange-Tags-July2024\\stackexchange-Tags-July2024\\'

for filename in glob.glob(os.path.join(tags_directory_path, '*')):

    print(filename)
    head, tail = os.path.split(filename)
    tail = tail.replace(" ","")
    tail = tail.replace(".csv","")
    tail = tail.replace("-","")
    tail = tail.split("(")[0]
     
    print(tail)
    query="CREATE TABLE {} (AutoIdPK int Identity(1,1),Id int, TagName nvarchar(35), Count int, ExcerptPostId int,WikiPostId int, IsRequired bit,IsModeratorOnly bit)".format(tail)    
    print(query)
    cursor.execute(query)
    print("Created table in database")
    tags_df = pd.read_csv(filename)  
    tags_df = tags_df.fillna(0) 
    print(tags_df.head())
    
    columnnames= list(tags_df.columns.values)
    columns=','.join(columnnames)

    query = "insert into {} ({}) values ({})".format(tail, columns, "?,"*(len(columnnames)-1)+"?")
    print(query)
    cursor.executemany(query, tags_df.values.tolist())
    print("Added in database")
    
cursor.commit()
print("Process completed")


