In [9]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df= pd.read_csv("datasets/scores_dataset.csv")

In [4]:
reference_answers = {

    "Q1": (
        "Object-Oriented Programming (OOP) is a programming paradigm that organizes software "
        "design around objects that combine data and behavior. Unlike procedural programming, "
        "which focuses on functions operating on data, OOP models real-world entities as objects. "
        "The four core principles of OOP are encapsulation, abstraction, inheritance, and polymorphism. "
        "Encapsulation hides internal data and exposes it through controlled interfaces, abstraction "
        "focuses on essential features, inheritance enables code reuse through class hierarchies, and "
        "polymorphism allows different objects to respond differently to the same method. A real-world "
        "example of encapsulation is a bank account, where balance details are accessed only through "
        "methods like deposit and withdraw. OOP may not be ideal for small or performance-critical "
        "systems such as embedded software."
    ),

    "Q2": (
        "Arrays and linked lists are fundamental data structures used to store collections of elements. "
        "Arrays store elements in contiguous memory locations, allowing constant-time access using an "
        "index, but insertions and deletions are costly because elements may need to be shifted. Linked "
        "lists store elements in non-contiguous memory locations, where each node contains data and a "
        "reference to the next node, making access slower due to traversal but allowing efficient "
        "insertions and deletions. Arrays provide O(1) access and O(n) insertion and deletion in the "
        "worst case, while linked lists provide O(n) access and O(1) insertion or deletion when the "
        "position is known. A real-world example where linked lists are useful is a music playlist."
    ),

    "Q3": (
        "Time complexity measures how an algorithm’s running time grows as the input size increases, "
        "and it is important for comparing algorithms independently of hardware. Big-O notation "
        "expresses the upper bound of an algorithm’s growth rate. Linear time O(n) occurs when an "
        "algorithm processes each element once, logarithmic time O(log n) appears in algorithms like "
        "binary search, and quadratic time O(n^2) occurs in algorithms with nested loops such as "
        "bubble sort. Worst-case analysis is preferred because it guarantees performance limits under "
        "all possible input conditions."
    ),

    "Q4": (
        "Compiled programming languages translate source code into machine code before execution, "
        "producing an executable file, while interpreted languages execute code line by line at runtime "
        "using an interpreter. Examples of compiled languages include C and C++, while Python and "
        "JavaScript are commonly interpreted. Just-In-Time compilation, used in languages like Java, "
        "combines both approaches by compiling frequently executed code segments at runtime to improve "
        "performance. Interpreted languages are not always slower than compiled ones due to modern "
        "runtime optimizations."
    ),

    "Q5": (
        "A process is an independent program in execution with its own memory space, while a thread "
        "is a lightweight unit of execution within a process that shares memory with other threads. "
        "Context switching is the mechanism by which the operating system saves the state of a running "
        "task and restores another, allowing multitasking and efficient CPU utilization. Processes do "
        "not share memory by default, making them safer but heavier, whereas threads share memory, "
        "enabling faster communication but requiring synchronization. Multithreading improves "
        "performance in applications such as web servers handling multiple client requests."
    ),

    "Q6": (
        "Database normalization is the process of organizing data to reduce redundancy and improve "
        "integrity. First normal form ensures atomic values with no repeating groups. Second normal "
        "form removes partial dependencies so that non-key attributes depend on the entire primary key. "
        "Third normal form removes transitive dependencies so that non-key attributes depend only on "
        "the primary key. Excessive normalization can lead to complex queries and performance overhead "
        "due to multiple joins. Denormalization is often used in read-heavy systems such as analytics "
        "and reporting platforms."
    ),

    "Q7": (
        "Machine Learning enables systems to learn patterns from data rather than relying on manually "
        "written rules. Supervised learning uses labeled data such as spam email classification, "
        "unsupervised learning works with unlabeled data for tasks like customer segmentation, and "
        "reinforcement learning learns through rewards and penalties such as training agents to play "
        "games. Key challenges include data dependency, bias, lack of interpretability, and difficulty "
        "in generalizing to unseen data."
    ),

    "Q8": (
        "Artificial Intelligence is the field of creating systems capable of performing tasks that "
        "require human-like intelligence, such as reasoning and decision-making. Machine Learning is a "
        "subset of AI that allows systems to learn from data. Not all AI systems use Machine Learning; "
        "some rely on rule-based logic. Deep learning is a subset of Machine Learning that uses multi-"
        "layer neural networks. An example of non-ML AI is a rule-based expert system that uses "
        "predefined rules to make decisions."
    ),

    "Q9": (
        "When a user enters a URL, the browser resolves the domain name to an IP address using DNS. It "
        "then establishes a connection with the server using TCP through a three-way handshake. If "
        "HTTPS is used, a TLS handshake encrypts the communication. The browser sends a request, the "
        "server responds with encrypted data, and the browser decrypts and renders the webpage. HTTPS "
        "ensures secure communication."
    ),

    "Q10": (
        "The Software Development Lifecycle is a structured process that includes requirement analysis, "
        "design, implementation, testing, deployment, and maintenance. The waterfall model follows "
        "these phases sequentially and is suitable when requirements are stable. Agile development is "
        "iterative and incremental, emphasizing flexibility, continuous feedback, and frequent "
        "releases. Agile is preferred for dynamic projects, while waterfall suits regulated or "
        "contract-based environments."
    )

}


In [8]:
df.columns.tolist()

['question_id',
 'question',
 'answer',
 'ideal_score',
 'tfidf_cosine_score',
 'tfidf_score_100']

In [10]:
vectorizer= TfidfVectorizer(ngram_range=(2,3), stop_words="english", lowercase= True)

scores=[]
for _, row in df.iterrows():
    qid= row["question_id"]
    ans= row["answer"]
    ref_ans= reference_answers[qid]

    tfidf_matrix= vectorizer.fit_transform([ref_ans, ans])

    similarity= cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0] #1x1 matrix

    scores.append(similarity)

df["tfidf_2_3_ngram_score"] = [ score * 100 for score in scores]


In [12]:
df.to_csv("scores_dataset.csv", index= False)