Weekend Task - String Programs - TF-IDF #40

akash-coded · 2022-02-24T16:52:25Z

akash-coded
Feb 24, 2022
Maintainer

TF-IDF

Term Frequency - Inverse Document Frequency

TF-IDF stands for Term Frequency Inverse Document Frequency and it is a measure of how frequently a word appears in a series of documents. It is a very simple concept both in understanding and implementation. Its use cases include; Weighting of NLP tasks for text classification, error detection in writing, document ranking, search engine/information retrieval, and keyword matching.

There are two elements in TF-IDF:

Term Frequency (TF) - Simply a measure of how frequently a word occurs in a document.
Inverse Document Frequency (IDF) - The inverse of the frequency of the word across a set of documents.

Like other types of data, before doing any process It is important to remove or transform your text into a form that is utilizable and more realistic. For instance, stopwords and punctuation marks do not add value to the context of the document and hence would cause issues with determining the true similarity of documents. If two documents have comas, questions, and the pronoun “I” occurring at the same frequency then would these two documents be similar?
In-text processing we do this by:

Removing punctuations like . , ! $( ) * % @
Lower casing
Tokenization
Stop word removal

Steps to clean the data

Punctuation Removal:
In this step, all the punctuations from the text are removed. string library of Python contains some pre-defined list of punctuations such as ‘!”#$%&'()*+,-./:;?@[]^_`{|}~’
Example:
Lowering the case
It is one of the most common preprocessing steps where the text is converted into the same case preferably lower case.
Examples:
Tokenization
In this step, the text is split into smaller units. We can use either sentence tokenization or word tokenization based on our problem statement. For our case, we'll tokenize into words.
Examples:
Stop word removal
Stopwords are the commonly used words and are removed from the text as they do not add any value to the analysis. These words carry less or no meaning. A list of words that are considered stopwords for the English language are : [i, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t]
Examples:

Calculation

After all the text processing steps are performed, the final acquired data is converted into the numeric form using TF-IDF.

Calculating TF
Calculating IDF
Calculating TF-IDF

A Step-by-Step Example

Let's look at an actual example of this, lets say we have the following documents:

As we discussed earlier, we need to perform some pre-processing on the data. In our case we did the following; removed stop words, removed punctuation marks, and converted all the words to lowercase.

Then we do tokenization, following which the unique words that we have in all documents are name, naftal, car, hyundai, drive, sonata, and model. Let's identify the frequency of occurrence in the various documents.

Let's normalize this data across the rows to sum it to one for each document to generate our final TF table.

Now let's calculate the IDF values, remember all we need to do for this is find the log of the ratio of the number of documents versus the total number of documents where that word occurs. Let's take the case of the word “name”. It appears once in document one and not in any of the other 3 documents. Its IDF will be:

The computed IDF values for the words are as follows:

The finalized TF-IDF table is as follows:

Note that the table above represents what we expected, more frequently occurring words have higher TF-IDF values for the documents in which they occur.

Task

Given four text files, write a program to construct the TF-IDF matrix for unique significant words.
First, read the text from the files programmatically, pre-process the text as per the aforementioned steps, and then construct and print the TF-IDF matrix.

The contents of the four text files are given as follows:

file1.txt

My name is Groot.

file2.txt

My ship is a Starship.

file3.txt

The ship I drive, is, a Starship Benatar.

file4.txt

My ship is a Benatar model of Starship !!

Sarfinaa · 2022-02-25T22:00:42Z

Sarfinaa
Feb 25, 2022

import java.util.*;
import java.io.*;
import java.util.stream.*;

public class TFIDF {
    public static String[] removePunctuation(StringBuilder[] sdata) {
        String str = "‘!”#$%&'()*+,-./:;?@[]^_`{|}~’";
        String[] ndata = new String[sdata.length];
        for (int i = 0; i < ndata.length; i++) {
            StringBuilder data = sdata[i];
            for (int j = data.length() - 1; j >= 0; j--) {
                if (str.contains(data.charAt(j) + "")) {
                    data.deleteCharAt(j);
                }
            }
            ndata[i] = data.toString();
        }
        return ndata;
    }

    public static void Lowercase(String[] str) {
        for (int i = 0; i < str.length; i++) {
            str[i] = str[i].toLowerCase();
        }
    }

    public static List<List<String>> tokenization(String[] data) {
        List<List<String>> list = new ArrayList<>();
        for (String str : data) {
            String[] arr = str.split(" ");
            list.add(new ArrayList<>(Arrays.asList(arr)));
        }
        return list;
    }

    public static void stopWordRemoval(List<List<String>> list) {
        String stopWord = "i, a, is, me, by, my, of, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t";
        for (int i = 0; i < list.size(); i++) {
            List<String> l = list.get(i).stream().filter(word -> !stopWord.contains(word)).collect(Collectors.toList());
            list.set(i, l);
        }
    }

    public static String[] getUniqueWords(List<List<String>> list) {
        Set<String> set = new HashSet();
        for (List<String> l : list) {
            for (String str : l) {
                set.add(str);
            }
        }
        return set.toArray(new String[set.size()]);
    }

    public static double countWords(String word, List<String> words) {
        double ans = 0;
        for (String str : words) {
            if (str.equals(word)) {
                ans++;
            }
        }
        return ans;
    }

    public static double[] generateTFAndIdf(double[][] matrix, String[] uniqueWords, List<List<String>> list) {
        double[] idf = new double[uniqueWords.length];
        for (int j = 0; j < matrix[0].length; j++) {
            String word = uniqueWords[j];
            double noOfTimesTOccured = 0;
            for (int i = 0; i < matrix.length; i++) {
                List<String> words = list.get(i);
                matrix[i][j] = countWords(word, words) / words.size();
                if (matrix[i][j] > 0) {
                    noOfTimesTOccured++;
                }
            }
            idf[j] = Math.log10(matrix.length / noOfTimesTOccured);

        }
        return idf;
    }

    public static void printMatrix(double[][] matrix, String[] uniqueWords) {
        System.out.print("Document| ");
        for (String word : uniqueWords)
            System.out.print(word + " ");
        System.out.println();
        for (int i = 0; i < matrix.length; i++) {
            System.out.print("file" + (i + 1) + "   | ");
            for (int j = 0; j < matrix[0].length; j++) {
                System.out.print(String.format("%.3f ", matrix[i][j]));
            }
            System.out.println();
        }
    }

    public static void calculateTFIDF(double[][] matrix, double[] idf) {
        for (int j = 0; j < matrix[0].length; j++) {
            for (int i = 0; i < matrix.length; i++) {
                matrix[i][j] *= idf[j];
            }
        }
    }

    public static void main(String[] args) throws IOException {
        String[] files = { "file1.txt", "file2.txt", "file3.txt", "file4.txt" };
        StringBuilder[] data = new StringBuilder[files.length];

        for (int i = 0; i < files.length; i++) {
            Scanner s = new Scanner(new File(files[i]));
            while (s.hasNextLine()) {
                data[i] = new StringBuilder(s.nextLine());
            }
        }
        String[] ndata = removePunctuation(data);
        Lowercase(ndata);
        List<List<String>> list = tokenization(ndata);
        stopWordRemoval(list);
        String[] uniqueWords = getUniqueWords(list);
        double[][] matrix = new double[files.length][uniqueWords.length];
        double[] idf = generateTFAndIdf(matrix, uniqueWords, list);
        calculateTFIDF(matrix, idf);
        printMatrix(matrix, uniqueWords);

    }

}

0 replies

Ashwesh09 · 2022-02-27T15:54:53Z

Ashwesh09
Feb 27, 2022

package week4;

import java.io.*;
import java.lang.reflect.Array;
import java.text.DecimalFormat;
import java.util.*;

public class TFandIDFtextAnalysis {
    public static void writeInFile(String s1, String s2, String s3, String s4,String file1, String file2, String file3, String file4) {
        try (FileWriter fw1 = new FileWriter(file1);FileWriter fw2 = new FileWriter(file2);
            FileWriter fw3 = new FileWriter(file3);FileWriter fw4 = new FileWriter(file4);){
            fw1.write(s1);
            fw2.write(s2);
            fw3.write(s3);
            fw4.write(s4);
        } catch (IOException e) {
            System.out.println(e.getMessage());
        }
    }

    public static String[] readFromFile(String file1, String file2, String file3, String file4) {
        try(FileReader fr1 = new FileReader(file1);
                FileReader fr2 = new FileReader(file2);
                FileReader fr3 = new FileReader(file3);
                FileReader fr4 = new FileReader(file4);) {
            String[] res = new String[4];
            int i = 0;
            for(String s : res) 
                res[i++] = "";
            int ch;
            while ((ch = fr1.read()) != -1)
                res[0] += (char) ch;
            while ((ch = fr2.read()) != -1)
                res[1] += (char) ch;
            while ((ch = fr3.read()) != -1)
                res[2] += (char) ch;
            while ((ch = fr4.read()) != -1)
                res[3] += (char) ch;
            return res;
            }
            catch (IOException e) {
                System.out.println(e.getMessage());
                return new String[0];
            }
    }

    public static List<String> filterString(String s) {
        String[] strArr = s.replaceAll("\\p{Punct}", "").toLowerCase().split("\\s");
        strArr = removeStopWord(strArr);
        return Arrays.asList(strArr);
    }

    public static String[] removeStopWord(String[] s) {
        String stopWords = "i, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t, the, a, by, is, of";
        String[] arr = stopWords.split(", ");
        for (String str1 : arr) {
            for (String str2 : s)
                if (str2.equals(str1)) {
                    s = remove(s, str1);
                }
        }
        return s;
    }

    public static String[] remove(String[] array, String element) {
    if (array.length > 0) {
        int index = -1;
        for (int i = 0; i < array.length; i++) {
            if (array[i].equals(element)) {
                index = i;
                break;
            }
        }
        if (index >= 0) {
            String[] copy = (String[]) Array.newInstance(array.getClass()
                    .getComponentType(), array.length - 1);
            if (copy.length > 0) {
                System.arraycopy(array, 0, copy, 0, index);
                System.arraycopy(array, index + 1, copy, index, copy.length - index);
            }
            return copy;
        }
    }
    return array;
}
    public static Set<String> getCommonWords(List<List<String>> input) {
        Set<String> res = new HashSet<>();
        for(List<String> s : input) {
             for (String x : s)
                    res.add(x);
        }
        return res;
    }

    public static double[][] wordFreqTable(Set<String> setOfCommonWords, List<List<String>> listofString) {
        double[][] tf = new double[4][setOfCommonWords.size()];
        List<String> commonWords = new ArrayList<>();
        commonWords.addAll(setOfCommonWords);
        String[][] strings = new String[4][setOfCommonWords.size()];
        int i = 0, j = 0;
        for (List<String> line : listofString)
            strings[i++] = line.toArray(new String[0]);
        for (i = 0; i < 4; i++) 
            for (j = 0; j < setOfCommonWords.size(); j++) 
                if (Arrays.asList(strings[i]).contains(commonWords.get(j))) 
                    tf[i][j]++;
        return tf;
    }

    public static double[][] normalizedTFtable(double[][] tf, int row, int col) {
        double[] sumRow = new double[row];
        for (int i = 0; i < row; i++) {
            for (int j = 0; j < col; j++) {
                sumRow[i] = sumRow[i] + tf[i][j];
            }
        }
        for (int i = 0; i < row; i++) {
            for (int j = 0; j < col; j++) {
                tf[i][j] /= sumRow[i];
            }
        }
        return tf;
    }

    public static double[] getIDFandDisplayIDF(double[][]  tfAndIdf, int row, int col,Set<String> setOfCommonWords){
        double[] IDF = new double[col];
        for (int i = 0; i < col; i++) {
            for (int j = 0; j < row; j++) {
                IDF[i] += tfAndIdf[j][i];
            }
        }
        DecimalFormat df = new DecimalFormat("#0.0000");
        System.out.print("IDF\t");
            for (String words : setOfCommonWords) {
                System.out.print(words + "\t");
            }
            System.out.print("\n\t");
            for (int i = 0; i < col ; i++) {
                IDF[i] = Math.log10(row / IDF[i]);
                System.out.print(df.format(IDF[i]) + "\t");
            }
            System.out.println("\n");
        return IDF;
    }

    public static void displayTable(double[][] tf, int n, int m, Set<String> setOfCommonWords) {
        DecimalFormat df = new DecimalFormat("#0.00000");
        System.out.print("D/W\t");
        for (String words : setOfCommonWords) {
            if(words.length() > 7)
                System.out.print(words + "\t");
            else    
                System.out.print(words + "\t\t");
        }
        System.out.println();
        for (int i = 0; i < n; i++) {
            System.out.print("D " + (i + 1) + ":\t");
            for (int j = 0; j < m; j++) {
                System.out.print(df.format(tf[i][j]) + "\t\t");
            }
            System.out.println();
        }
    }

    public static double[][] getTFandIDF(double[][] tfAndIdf, double[] idf, int row, int col) {
        for (int j = 0; j < col; j++) {
            for (int i = 0; i < row; i++)
                tfAndIdf[i][j] *= idf[j];
        }
        return tfAndIdf;
    }
    public static void main(String[] args) {
        String s1 = "My name is Groot.";
        String s2 = "My ship is a Starship.";
        String s3 = "The ship I drive, is, a Starship Benatar.";
        String s4 = "My ship is a Benatar model of Starship !!";
        String file1 = "week4/file1.txt";
        String file2 = "week4/file2.txt";
        String file3 = "week4/file3.txt";
        String file4 = "week4/file4.txt";
        /**
         * Write in File:
         */
        writeInFile(s1,s2,s3,s4,file1,file2,file3,file4);
        /**
         * Read from files:
         */
        String[] stotal = readFromFile(file1,file2,file3,file4);
        /**
         * Filter the strings:
         */
        List<List<String>> filteredString = new ArrayList<>();
        for(String s : stotal)
            filteredString.add(filterString(s));
        /**
         * Finding commom words:
         */
        Set<String> setOfCommonWords = new HashSet<>();
        setOfCommonWords = getCommonWords(filteredString);
        /**
         * Freqency of occurrences:
         */
        double[][] tf = new double[4][setOfCommonWords.size()];
        tf = wordFreqTable(setOfCommonWords, filteredString);
        System.out.println("\n ---- Count table ---- \n");
        displayTable(tf, 4, setOfCommonWords.size(), setOfCommonWords);

        /**
         *  IDF and Normalize this data across the rows to sum it to one for each document to generate our final TF table.
         */
        System.out.println("\n ----  IDF Table -----\n");
        double[] idf = new double[setOfCommonWords.size()];
        idf = getIDFandDisplayIDF(tf, 4, setOfCommonWords.size(),setOfCommonWords);
        normalizedTFtable(tf, 4, setOfCommonWords.size());
        System.out.println("\n --- Normalized TF table ---- \n");
        displayTable(tf, 4, setOfCommonWords.size(), setOfCommonWords);
        /**
         * FINAL TS AND IDF table:
         */
        getTFandIDF(tf, idf, 4, setOfCommonWords.size());
        System.out.println("\n------ FINAL TF and IDF Table: ------\n");
        displayTable(tf, 4, setOfCommonWords.size(), setOfCommonWords);
    }
}

0 replies

lalit-ky · 2022-02-27T16:04:17Z

lalit-ky
Feb 27, 2022

import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

public class TfIdf {
    public static String removePunctuation(String s) {
        return s.replace("[", "").replace("]", "").replaceAll("[!\"”#$%&'()*+,-./:;?@^_`{|}~]", "");
    }

    public static String[] removeStopWords(String s) {
        String[] stopWords = new String[] { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "a",
                "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "most", "other",
                "some", "such", "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "is", "t",
                "can", "will", "just", "don", "don't", "should", "should've", "now", "d", "ll", "m", "o", "re", "ve",
                "y", "ain", "aren't", "could", "couldn't", "didn't", "didn't", "the", "of" };
        String[] tokens = s.split(" ");
        String[] ans;
        int count = 0;
        for (int i = 0; i < tokens.length; i++) {
            for (String str : stopWords) {
                if (tokens[i].equals(str)) {
                    count++;
                }
            }
        }
        ans = new String[(tokens.length - count)];
        int j = 0;
        for (int i = 0; i < tokens.length; i++) {
            boolean flag = true;
            for (String str : stopWords) {
                if (tokens[i].equals(str)) {
                    flag = false;
                    break;
                }
            }
            if (flag) {
                ans[j] = tokens[i];
                j++;
            }
        }
        return ans;
    }

    public static String[][] cleanData(File[] f) {

        String[][] data = new String[f.length][];
        for (int i = 0; i < f.length; i++) {
            try (Scanner fReader = new Scanner(f[i])) {
                String str = "";
                while (fReader.hasNextLine()) {
                    str += fReader.nextLine() + " ";
                }

                str = removePunctuation(str);
                str = str.toLowerCase();
                data[i] = removeStopWords(str);

            } catch (FileNotFoundException e) {
                System.out.println(e.getMessage());
            }
        }
        return data;
    }

    public static String[] getUniqueWords(String[][] f) {
        Set<String> set = new HashSet<>();
        for (String[] file : f) {
            for (String str : file) {
                set.add(str);
            }
        }
        return set.toArray(new String[set.size()]);
    }

    public static void calculateTF(double[][] matrix, String[][] cleanFiles, String[] uniques) {
        for (int i = 0; i < cleanFiles.length; i++) {
            for (int j = 0; j < uniques.length; j++) {
                int count = 0;
                String word = uniques[j];
                for (int k = 0; k < cleanFiles[i].length; k++) {
                    if (word.equals(cleanFiles[i][k])) {
                        count++;
                    }
                }
                matrix[i][j] = (double) count / cleanFiles[i].length;
            }
        }
    }

    public static double[] calculateIDF(String[][] f, String[] unique) {
        double[] freq = new double[unique.length];
        int j = 0;

        for (String word : unique) {
            int count = 0;
            for (String[] file : f) {
                for (String wordinFile : file) {
                    if (wordinFile.equals(word)) {
                        count++;
                        break;
                    }
                }
            }
            freq[j] = count;
            j++;
        }
        int noOfDocuments = f.length;
        for (int i = 0; i < freq.length; i++) {
            freq[i] = Math.log10(noOfDocuments / freq[i]);
        }
        return freq;
    }

    public static void calculateTFIDF(double[][] matrix, double[] idf) {
        for (int i = 0; i < matrix.length; i++) {
            for (int j = 0; j < idf.length; j++) {
                matrix[i][j] *= idf[j];
            }
        }
    }

    public static void printMatrix(double[][] matrix, String[] uniqueWords) {
        System.out.print("Document| ");
        for (String word : uniqueWords)
            System.out.print(word + "\t\t\t");
        System.out.println();
        for (int i = 0; i < matrix.length; i++) {
            System.out.print("file" + (i + 1) + "   | ");
            for (int j = 0; j < matrix[0].length; j++) {
                System.out.printf("%.5f\t\t\t", matrix[i][j]);
            }
            System.out.println();
        }
    }

    public static void main(String[] args) {
        File f1 = new File("file1.txt");
        File f2 = new File("file2.txt");
        File f3 = new File("file3.txt");
        File f4 = new File("file4.txt");
        File[] file = new File[] { f1, f2, f3, f4 };

        String[][] cleanFiles = cleanData(file);
        String[] unique = getUniqueWords(cleanFiles);

        double[][] matrix = new double[cleanFiles.length][unique.length];

        calculateTF(matrix, cleanFiles, unique);
        double[] idf = calculateIDF(cleanFiles, unique);

        calculateTFIDF(matrix, idf);

        printMatrix(matrix, unique);

    }
}

0 replies

Snehil6197 · 2022-02-27T21:30:15Z

Snehil6197
Feb 27, 2022

import java.io.IOException;
import java.util.*;
import java.io.File;

public class DocsOperation {
static String stemp = "";
static List finalWordsList = new ArrayList();

static void matrix(String[][] ar) {
    Set<String> stringsList = new HashSet<String>(finalWordsList);
    List<String> hset = new ArrayList<>(stringsList);
    double[][] matrix = new double[4][hset.size()];
    for (int i = 0; i < 4; i++) {
        for (int j = 0; j < hset.size(); j++) {
            matrix[i][j] = 0;
        }
    }
    for (int i = 0; i < 4; i++) {
        for (int j = 0; j < ar[i].length; j++) {
            int index = hset.indexOf(ar[i][j]);
            matrix[i][index]++;
            matrix[i][index] /= ar[i].length;
        }

    }
    for (int i = 0; i < hset.size(); i++) {
        int l = 0;
        for (int j = 0; j < 4; j++) {
            if (matrix[j][i] != 0)
                l++;
        }
        // formula
        try {
            double temp =Math.log10(4 / (double) l);
            for (int j = 0; j < 4; j++) {
                matrix[j][i]*= temp;
            }
        } catch (Exception e) {
        }
        
    }
    for (int i = 0; i < hset.size(); i++) {
        for (int j = 0; j < 4; j++) {
           System.out.print(matrix[j][i]+" ");
        }
        System.out.println();
    }
}

// *************************************************FUNCTION********************************************************
public static List stopWordsRemove(List<String> wordsList) {
    String stopWord = "the, is, of, a, by, i, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t";
    stopWord = stopWord.trim().replaceAll(",", " ").replaceAll("( )+", " ");
    List<String> stopWordsList = new ArrayList<String>(Arrays.asList(stopWord.split(" ")));
    for (int i = 0; i < wordsList.size(); i++) {
        for (int j = 0; j < stopWordsList.size(); j++) {
            if ((wordsList.get(i)).equals(stopWordsList.get(j))) {
                wordsList.remove(i);
            }
        }
    }
    for (String i : wordsList) {
        finalWordsList.add(i);
    }
    return wordsList;
}

// *************************************************FUNCTION******************************************************************************************
public static void removePunctuationsLowercase(String s) {

    stemp += s.trim().replaceAll("[^a-zA-Z0-9]", " ").replaceAll("( )+", " ").toLowerCase();
}

// *************************************************FUNCTION*************************************

public static void main(String[] args) {
    String[] path = { "File1.txt", "File2.txt","File3.txt", "File4.txt" };
    String[][] ar = new String[4][];
    for (int i = 0; i < 4; i++) {
        File f = new File(path[i]);

        try (Scanner sc = new Scanner(f)) {
            while (sc.hasNextLine()) {
                String temp = sc.nextLine();
                removePunctuationsLowercase(temp);

            }
        } catch (IOException e) {
            System.out.println("exception occured");

        }
        List<String> wordsList = new ArrayList<String>(Arrays.asList(stemp.split(" ")));
        List<String> temp = stopWordsRemove(wordsList);
        ar[i] = new String[temp.size()];
        for (int k = 0; k < temp.size(); k++) {
            ar[i][k] = temp.get(k);
        }
        stemp = "";

    }
    matrix(ar);

}

}

0 replies

Mrafe · 2022-03-01T15:53:07Z

Mrafe
Mar 1, 2022

package tasks;

import java.util.*;
import java.io.File;
import java.io.FileNotFoundException;

class Cleaning {
    public String punctuationRemoval(String s) {
        return s.replaceAll("[!\"”#$%&'()*+,-./:;?@^_`{|}~]", "");
    }

    public static String[] removeStopWords(String s) {
        String[] stopWords = new String[] { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "a",
                "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "most", "other",
                "some", "such", "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "is", "t",
                "can", "will", "just", "don", "don't", "should", "should've", "now", "d", "ll", "m", "o", "re", "ve",
                "y", "ain", "aren't", "could", "couldn't", "didn't", "didn't", "the", "of" };
        String[] words = s.split(" ");
        String[] rep;
        int count = 0;
        for (int i = 0; i < words.length; i++) {
            for (String str : stopWords) {
                if (words[i].equals(str)) {
                    count++;
                }
            }
        }
        rep = new String[(words.length - count)];
        int j = 0;
        for (int i = 0; i < words.length; i++) {
            boolean flag = true;
            for (String str : stopWords) {
                if (words[i].equals(str)) {
                    flag = false;
                    break;
                }
            }
            if (flag) {
                rep[j] = words[i];
                j++;
            }
        }
        return rep;
    }

    public String[][] cleanData(File[] f) {

        String[][] data = new String[f.length][];
        for (int i = 0; i < f.length; i++) {
            try (Scanner fReader = new Scanner(f[i])) {
                String str = "";
                while (fReader.hasNextLine()) {
                    str += fReader.nextLine() + " ";
                }
                str = str.toLowerCase();
                str = punctuationRemoval(str);
                data[i] = removeStopWords(str);

            } catch (FileNotFoundException e) {
                System.out.println(e.getMessage());
            }
        }
        return data;
    }

    public String[] getUniqueWords(String[][] f) {
        Set<String> set = new HashSet<>();
        for (String[] file : f) {
            for (String str : file) {
                set.add(str);
            }
        }
        return set.toArray(new String[set.size()]);
    }

}

class TdfIdfCalulations {
    public void tFcalc(double[][] result, String[][] cleanFiles, String[] uniques) {
        for (int i = 0; i < cleanFiles.length; i++) {
            for (int j = 0; j < uniques.length; j++) {
                int count = 0;
                String word = uniques[j];
                for (int k = 0; k < cleanFiles[i].length; k++) {
                    if (word.equals(cleanFiles[i][k])) {
                        count++;
                    }
                }
                result[i][j] = (double) count / cleanFiles[i].length;
            }
        }
    }

    public double[] calculateIDF(String[][] f, String[] unique) {
        double[] freq = new double[unique.length];
        int j = 0;

        for (String word : unique) {
            int count = 0;
            for (String[] file : f) {
                for (String wordinFile : file) {
                    if (wordinFile.equals(word)) {
                        count++;
                        break;
                    }
                }
            }
            freq[j] = count;
            j++;
        }
        int noOfDocuments = f.length;
        for (int i = 0; i < freq.length; i++) {
            freq[i] = Math.log10(noOfDocuments / freq[i]);
        }
        return freq;
    }

    void tFCalc(double[][] result, double[] idf) {
        for (int i = 0; i < result.length; i++) {
            for (int j = 0; j < idf.length; j++) {
                result[i][j] *= idf[j];
            }
        }
    }

    void print(double[][] result, String[] uniqueWords) {
        System.out.print("|\tDocument| ");
        for (String word : uniqueWords)
            System.out.print(word + "\t|\t");
        System.out.println();
        for (int i = 0; i < result.length; i++) {
            System.out.print("|\tfile" + (i + 1) + "   | ");
            for (int j = 0; j < result[0].length; j++) {
                System.out.printf("%.5f\t|\t", result[i][j]);
            }
            System.out.println();
        }
    }
}

class Main {
    public static void main(String[] args) {
        File file1 = new File("tasks/file1.txt");
        File file2 = new File("tasks/file2.txt");
        File file3 = new File("tasks/file3.txt");
        File file4 = new File("tasks/file4.txt");
        File[] fileArray = new File[] { file1, file2, file3, file4 };

        Cleaning clean = new Cleaning();

        String[][] cleanFiles = clean.cleanData(fileArray);
        String[] unique = clean.getUniqueWords(cleanFiles);

        double[][] result = new double[cleanFiles.length][unique.length];

        TdfIdfCalulations cal = new TdfIdfCalulations();
        cal.tFcalc(result, cleanFiles, unique);
        double[] idf = cal.calculateIDF(cleanFiles, unique);

        cal.tFCalc(result, idf);

        cal.print(result, unique);

    }
}

0 replies

Harshita-Joshi · 2022-03-03T03:09:43Z

Harshita-Joshi
Mar 3, 2022

package TFIDF;
import java.io.FileReader;
import java.util.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class TFTask {

    // Reading from each file
    static String readFromFile(String fileName){

        String data = null;
        try{
                FileReader fr = new FileReader(fileName);
                Scanner scn = new Scanner(fr);
                while(scn.hasNextLine()){
                    data = scn.nextLine();
                }
                scn.close();
            }catch(Exception e){
                e.getMessage();
            }
        return data;
    }

    // Function for punctuation removal
    static String removePunctuation(String lineFromEachFile){
        
        String str = "”!#$%&'()*+,-./:;?@[]^_`{|}~";

        StringBuilder line = new StringBuilder(lineFromEachFile);
        
        for(int i=line.length()-1; i>=0  ; i--){
            char ch = line.charAt(i);
            String character = Character.toString(ch);
            if(str.contains(character)){
                line.deleteCharAt(i);
            }
        }
        return new String(line);
    }

    // Function to convert file data to lower case
    static String convertToLower(String lineFromEachFile){
        return lineFromEachFile.toLowerCase();
    }

    // Function to split the file into words
    static String[] tokenize(String lineFromEachFile){
         return lineFromEachFile.split(" ");
    }

    // Function to remove stop words
    static String[] removeStopWords(String[] wordArray){
        String stopWords = "i, is, me, my, myself, we, of, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t";

        List<String> wordArrayList = new ArrayList<String>();

        for(int i=0; i<wordArray.length; i++){
            if(!stopWords.contains(wordArray[i])){
                wordArrayList.add(wordArray[i]);
            }
        }

        String[] wordArrayRes = new String[wordArrayList.size()];
        wordArrayRes =  wordArrayList.toArray(wordArrayRes);
        return wordArrayRes;
    }

    // Function to find unique words from each file
    static String[] findAllUniqueWords(String[] wordsInFile1, String[] wordsInFile2, String[] wordsInFile3, String[] wordsInFile4){
        List<String> uniqueWords = new ArrayList<String>();

        //Adding unique words from file 1
        for(int i=0; i<wordsInFile1.length; i++){
            if(!uniqueWords.contains(wordsInFile1[i])){
                uniqueWords.add(wordsInFile1[i]);
            }
        }

        //Adding unique words from file 2
        for(int i=0; i<wordsInFile2.length; i++){
            if(!uniqueWords.contains(wordsInFile2[i])){
                uniqueWords.add(wordsInFile2[i]);
            }
        }

        //Adding unique words from file 3
        for(int i=0; i<wordsInFile3.length; i++){
            if(!uniqueWords.contains(wordsInFile3[i])){
                uniqueWords.add(wordsInFile3[i]);
            }
        }

        //Adding unique words from file 4
        for(int i=0; i<wordsInFile4.length; i++){
            if(!uniqueWords.contains(wordsInFile4[i])){
                uniqueWords.add(wordsInFile4[i]);
            }
        }

        String[] allUniqueWords = new String[uniqueWords.size()];
        allUniqueWords = uniqueWords.toArray(allUniqueWords);
        return allUniqueWords;
    }

    // Function to generate TF array for each file
    static double[] generateTFForEachDocument(String[] wordsInEachFile){

        int totalWords = wordsInEachFile.length;
        double[] tfForEachFile = new double[totalWords];

        for(int i=0; i<totalWords; i++){
            tfForEachFile[i] = (double)1/totalWords;
        }

        return tfForEachFile;

    }

    // Function to generate IDF array for each unique word
    static double[] generateIDFForAllWords(String[] allUniqueWords, String[] wordsInFile1, String[] wordsInFile2, String[] wordsInFile3, String[] wordsInFile4){

        int[] wordOccurenceArray = new int[allUniqueWords.length];

        for(int i=0; i<allUniqueWords.length; i++){

            for(String str: wordsInFile1){
                if(str.equals(allUniqueWords[i])){
                    wordOccurenceArray[i]++;
                }
            }

            for(String str: wordsInFile2){
                if(str.equals(allUniqueWords[i])){
                    wordOccurenceArray[i]++;
                }
            }

            for(String str: wordsInFile3){
                if(str.equals(allUniqueWords[i])){
                    wordOccurenceArray[i]++;
                }
            }

            for(String str: wordsInFile4){
                if(str.equals(allUniqueWords[i])){
                    wordOccurenceArray[i]++;
                }
            }

        }

        double[] IDFArray = new double[wordOccurenceArray.length];
        for(int i=0; i<IDFArray.length; i++){
            IDFArray[i] = Math.log10((double)4 / wordOccurenceArray[i]);
        }

        return IDFArray;

    }

    // Function to generate TF-IDF Matrix
    static double[][] generateTFIDFMatrix(String[] allUniqueWords, double[] IDFArray, double[] tfForFile1, double[] tfForFile2, double[] tfForFile3, double[] tfForFile4, String[] wordsInFile1, String[] wordsInFile2, String[] wordsInFile3, String[] wordsInFile4){

        double[][] TFIDFMat = new double[4][allUniqueWords.length];

        //File 1
        for(int j=0; j<TFIDFMat[0].length; j++){
            String fromUniqueWords = allUniqueWords[j];
            for(int k = 0; k<wordsInFile1.length; k++){
                if(fromUniqueWords.equals(wordsInFile1[k])){
                    TFIDFMat[0][j] = IDFArray[j]*tfForFile1[k];
                }
            }
        }


       // File 2
        for(int j=0; j<TFIDFMat[1].length; j++){
            String fromUniqueWords = allUniqueWords[j];
            for(int k = 0; k<wordsInFile2.length; k++){
                if(fromUniqueWords.equals(wordsInFile2[k])){
                    TFIDFMat[1][j] = IDFArray[j]*tfForFile2[k];
                }
            }
        }

        // File 3
        for(int j=0; j<TFIDFMat[2].length; j++){
            String fromUniqueWords = allUniqueWords[j];
            for(int k = 0; k<wordsInFile3.length; k++){
                if(fromUniqueWords.equals(wordsInFile3[k])){
                    TFIDFMat[2][j] = IDFArray[j]*tfForFile3[k];
                }
            }
        }

        // File 4
        for(int j=0; j<TFIDFMat[3].length; j++){
            String fromUniqueWords = allUniqueWords[j];
            for(int k = 0; k<wordsInFile4.length; k++){
                if(fromUniqueWords.equals(wordsInFile4[k])){
                    TFIDFMat[3][j] = IDFArray[j]*tfForFile4[k];
                }
            }
        }

       
        return TFIDFMat;

    }

    // Function to display TF-IDF Matrix
    static void displayTFIDF(double[][] TFIDFMatrix, String[] allUniqueWords){

        System.out.print("Document  ");
        for(int i=0; i<allUniqueWords.length;i++){
            System.out.print(allUniqueWords[i]+"    ");
        }
        System.out.println();
        for(int i=0; i<TFIDFMatrix.length; i++){
            System.out.print(" D"+(i+1)+"       ");
            for(int j=0; j<TFIDFMatrix[0].length; j++){
                System.out.print(String.format("%.4f", TFIDFMatrix[i][j]));
                System.out.print("   ");
            }
            System.out.println();
        }
    }

    public static void main(String[] args) {
        
        Scanner scn;
        String[] fileData = new String[4];      //for storing file data
        FileReader fr;

        // Reading line from each file
        fileData[0] = readFromFile("TFIDF/file1.txt");
        fileData[1] = readFromFile("TFIDF/file2.txt");
        fileData[2] = readFromFile("TFIDF/file3.txt");
        fileData[3] = readFromFile("TFIDF/file4.txt");


        // Removing punctuation from each file
        for(int i=0; i<4; i++){
            fileData[i] =  removePunctuation(fileData[i]);
        }

        // Converting entire text to lowercase
        for(int i=0; i<4; i++){
            fileData[i] =  convertToLower(fileData[i]);
        }


        //Tokenization
        String[] wordsInFile1 = tokenize(fileData[0]);
        String[] wordsInFile2 = tokenize(fileData[1]);
        String[] wordsInFile3 = tokenize(fileData[2]);
        String[] wordsInFile4 = tokenize(fileData[3]);

        // Stop word removal
        wordsInFile1 = removeStopWords(wordsInFile1);
        wordsInFile2 = removeStopWords(wordsInFile2);
        wordsInFile3 = removeStopWords(wordsInFile3);
        wordsInFile4 = removeStopWords(wordsInFile4);

        // Finding unique words from all the files
        String[] allUniqueWords = findAllUniqueWords(wordsInFile1, wordsInFile2, wordsInFile3, wordsInFile4);
        
        // Generating TF array for each file
        double[] tfForFile1 = generateTFForEachDocument(wordsInFile1);
        double[] tfForFile2 = generateTFForEachDocument(wordsInFile2);
        double[] tfForFile3 = generateTFForEachDocument(wordsInFile3);
        double[] tfForFile4 = generateTFForEachDocument(wordsInFile4);


        // Generating IDF array for all words
        double[] IDFArray = generateIDFForAllWords(allUniqueWords, wordsInFile1, wordsInFile2, wordsInFile3, wordsInFile4);
        
        // Generating TF-IDF matrix
        double[][] TFIDFMatrix = generateTFIDFMatrix(allUniqueWords, IDFArray, tfForFile1, tfForFile2, tfForFile3, tfForFile4, wordsInFile1, wordsInFile2, wordsInFile3, wordsInFile4);

        // Displaying TF-IDF Matrix
        displayTFIDF(TFIDFMatrix, allUniqueWords);

    }
}

0 replies

preeti-max · 2022-03-03T05:16:01Z

preeti-max
Mar 3, 2022

package StringOperations;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Scanner;
import java.util.Set;

public class Tfidf {
    public static String punctuationRemoval(String content) {

        return content.replaceAll("[~!@#$%^&*()_+{}\\[\\]:;,.<>/?-]", "");

    }

    public static String[] removeStopWord(String content) {
        String stopwords[] = { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "you’re", "you’ve",
                "you’ll", "you’d", "your", "yours", "yourself", "yourselves", "he", "most", "other", "some", "such",
                "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "s", "t", "can", "will", "just",
                "don", "don’t", "should", "should’ve", "now", "d", "ll", "m", "o", "re", "ve", "y", "ain", "aren’t",
                "could", "couldn’t", "didn’t", "didn’t" };
        String modified[] = content.split(" ");

        List<String> a = new ArrayList<String>(Arrays.asList(modified));
        for (int i = 0; i < a.size(); i++) {
            for (String s : stopwords) {
                if (a.get(i).equals(s)) {
                    a.remove(i);
                }
            }
        }
        String[] b = a.toArray(new String[a.size()]);
        return b;

    }

    public static String[] getUniqueWords(String[][] content) {
        HashSet<String> set = new HashSet<>();
        for (String[] file : content) {
            for (String s : file) {
                set.add(s);
            }
        }
        return set.toArray(new String[set.size()]);
    }

    public static void tf(double[][] a, String[] s, String[][] content) {
        for (int i = 0; i < content.length; i++) {
            for (int j = 0; j < s.length; j++) {
                int count = 0;
                String word = s[j];
                for (int k = 0; k < content[i].length; k++) {
                    if (word.equals(content[i][k])) {
                        count++;
                    }
                }
                a[i][j] = (double) count / content[i].length;
            }
        }
    }

    public static double[] idf(String[][] content, String[] s) {
        double[] ans = new double[s.length];
        int j = 0;

        for (String word : s) {
            int count = 0;
            for (String[] file : content) {
                for (String i : file) {
                    if (i.equals(word)) {
                        count++;
                        break;
                    }
                }
            }
            ans[j] = count;
            j++;
        }
        int noOfDocuments = content.length;
        for (int i = 0; i < ans.length; i++) {
            ans[i] = Math.log10(noOfDocuments / ans[i]);
        }
        return ans;
    }

    public static void tfidf(double[][] matrix, double[] idf) {
        for (int i = 0; i < matrix.length; i++) {
            for (int j = 0; j < idf.length; j++) {
                matrix[i][j] *= idf[j];
            }
        }
    }

    public static void printMatrix(double[][] matrix, String[] uniqueWords) {
        System.out.print("Document :");
        for (String word : uniqueWords)
            System.out.print(word + "  ");
        System.out.println();
        for (int i = 0; i < matrix.length; i++) {
            System.out.print("file" + (i + 1) + "  : ");
            for (int j = 0; j < matrix[0].length; j++) {
                System.out.printf("%.3f ", matrix[i][j]);
            }
            System.out.println();
        }
    }

    public static void main(String[] args) {
        String files[] = { "file1.txt", "file2.txt", "file3.txt", "file4.txt" };
        String content[][] = new String[files.length][];
        for (int i = 0; i < files.length; i++) {
            try (Scanner sc = new Scanner(new File(files[i]))) {
                while (sc.hasNext()) {
                    String s = sc.nextLine();
                    s = punctuationRemoval(s).toLowerCase();
                    content[i] = removeStopWord(s);

                }
            } catch (FileNotFoundException e) {
                System.out.println(e);
            }
        }
        String[] unique = getUniqueWords(content);

        double[][] matrix = new double[content.length][unique.length];

        tf(matrix, unique, content);
        double[] idf = idf(content, unique);

        tfidf(matrix, idf);
        System.out.println("TF-IDF MATRIX");
        printMatrix(matrix, unique);

    }

}

0 replies

mkaifahmed · 2022-03-03T14:12:55Z

mkaifahmed
Mar 3, 2022

package Week4.Task;

import java.util.*;
import java.io.File;
import java.io.FileNotFoundException;

class Cleaning {
    public String punctuationRemoval(String s) {
        return s.replaceAll("[!\"”#$%&'()*+,-./:;?@^_`{|}~]", "");
    }

    public static String[] removeStopWords(String s) {
        String[] stopWords = new String[] { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "a",
                "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "most", "other",
                "some", "such", "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "is", "t",
                "can", "will", "just", "don", "don't", "should", "should've", "now", "d", "ll", "m", "o", "re", "ve",
                "y", "ain", "aren't", "could", "couldn't", "didn't", "didn't", "the", "of" };
        String[] words = s.split(" ");
        String[] rep;
        int count = 0;
        for (int i = 0; i < words.length; i++) {
            for (String str : stopWords) {
                if (words[i].equals(str)) {
                    count++;
                }
            }
        }
        rep = new String[(words.length - count)];
        int j = 0;
        for (int i = 0; i < words.length; i++) {
            boolean flag = true;
            for (String str : stopWords) {
                if (words[i].equals(str)) {
                    flag = false;
                    break;
                }
            }
            if (flag) {
                rep[j] = words[i];
                j++;
            }
        }
        return rep;
    }

    public String[][] cleanData(File[] f) {

        String[][] data = new String[f.length][];
        for (int i = 0; i < f.length; i++) {
            try (Scanner fReader = new Scanner(f[i])) {
                String str = "";
                while (fReader.hasNextLine()) {
                    str += fReader.nextLine() + " ";
                }
                str = str.toLowerCase();
                str = punctuationRemoval(str);
                data[i] = removeStopWords(str);

            } catch (FileNotFoundException e) {
                System.out.println(e.getMessage());
            }
        }
        return data;
    }

    public String[] getUniqueWords(String[][] f) {
        Set<String> set = new HashSet<>();
        for (String[] file : f) {
            for (String str : file) {
                set.add(str);
            }
        }
        return set.toArray(new String[set.size()]);
    }

}

class TdfIdfCalulations {
    public void tFcalc(double[][] result, String[][] cleanFiles, String[] uniques) {
        for (int i = 0; i < cleanFiles.length; i++) {
            for (int j = 0; j < uniques.length; j++) {
                int count = 0;
                String word = uniques[j];
                for (int k = 0; k < cleanFiles[i].length; k++) {
                    if (word.equals(cleanFiles[i][k])) {
                        count++;
                    }
                }
                result[i][j] = (double) count / cleanFiles[i].length;
            }
        }
    }

    public double[] calculateIDF(String[][] f, String[] unique) {
        double[] freq = new double[unique.length];
        int j = 0;

        for (String word : unique) {
            int count = 0;
            for (String[] file : f) {
                for (String wordinFile : file) {
                    if (wordinFile.equals(word)) {
                        count++;
                        break;
                    }
                }
            }
            freq[j] = count;
            j++;
        }
        int noOfDocuments = f.length;
        for (int i = 0; i < freq.length; i++) {
            freq[i] = Math.log10(noOfDocuments / freq[i]);
        }
        return freq;
    }

    void tFCalcc(double[][] result, double[] idf) {
        for (int i = 0; i < result.length; i++) {
            for (int j = 0; j < idf.length; j++) {
                result[i][j] *= idf[j];
            }
        }
    }

    void print(double[][] result, String[] uniqueWords) {
        System.out.print("|\tDocument| ");
        for (String word : uniqueWords)
            System.out.print(word + "\t|\t");
        System.out.println();
        for (int i = 0; i < result.length; i++) {
            System.out.print("|\tfile" + (i + 1) + "   | ");
            for (int j = 0; j < result[0].length; j++) {
                System.out.printf("%.5f\t|\t", result[i][j]);
            }
            System.out.println();
        }
    }
}

public class TFIDF {
    public static void main(String[] args) {
        File file1 = new File("core-java/Week4/Task/file1.txt");
        File file2 = new File("core-java/Week4/Task/file2.txt");
        File file3 = new File("core-java/Week4/Task/file3.txt");
        File file4 = new File("core-java/Week4/Task/file4.txt");
        File[] fileArray = new File[] { file1, file2, file3, file4 };

        Cleaning clean = new Cleaning();

        String[][] cleanFiles = clean.cleanData(fileArray);
        String[] unique = clean.getUniqueWords(cleanFiles);

        double[][] result = new double[cleanFiles.length][unique.length];

        TdfIdfCalulations cal = new TdfIdfCalulations();
        cal.tFcalc(result, cleanFiles, unique);
        double[] idf = cal.calculateIDF(cleanFiles, unique);

        cal.tFCalcc(result, idf);

        cal.print(result, unique);

    }
}

0 replies

Khyati-tripathi · 2022-03-03T19:36:32Z

Khyati-tripathi
Mar 3, 2022

package IFIDFTaskModule;

import java.io.*;
import java.util.*;
class CleanText {
    double ansTable[][];

    String removepuctuation(String text) {
        return text.replaceAll("\\p{Punct}", "");
    }

    String changetoLowerCase(String text) {
        return text.toLowerCase();
    }

    String[] stopWordRemoval(String s) {
        String[] stopWords = new String[] { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "a",
                "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "most", "other",
                "some", "such", "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "is", "t",
                "can", "will", "just", "don", "don't", "should", "should've", "now", "d", "ll", "m", "o", "re", "ve",
                "y", "ain", "aren't", "could", "couldn't", "didn't", "didn't", "the", "of" };
        String[] data = s.split(" ");
        int len = 0;
        for (int i = 0; i < data.length; i++) {
            for (String j : stopWords) {
                if (data[i].equals(j))
                    len++;
            }
        }
        String[] clean = new String[data.length - len];
        int p = 0;
        for (String i : data) {
            boolean f = true;
            for (String j : stopWords) {
                if (i.equals(j)) {
                    f = false;
                    break;
                }
            }
            if (f)
                clean[p++] = i;
        }
        return clean;
    }

    String[] uniqueWords(String[][] list) {
        Set<String> set = new HashSet<>();
        for (String[] l : list) {
            for (String str : l) {
                set.add(str);
            }
        }
        return set.toArray(new String[set.size()]);
    }

    void getTF(String[] unique, String[][] list) {
        for (int i = 0; i < list.length; i++) {
            for (int j = 0; j < unique.length; j++) {
                int count = 0;
                String s = unique[j];
                for (int k = 0; k < list[i].length; k++) {
                    if (s.equals(list[i][k])) {
                        count++;
                    }
                }
                ansTable[i][j] = count / (double) list.length;

            }

        }

    }

    void getIDF(String[] unique, String[][] list) {
        getTF(unique, list);
        double[] idf = new double[unique.length];
        int p = 0;
        for (String i : unique) {
            int c = 0;
            for (String[] w : list) {
                for (String j : w) {

                    if (j.equals(i)) {
                        c++;
                        break;
                    }
                }
            }
            idf[p++] = c;
        }
        for (int i = 0; i < idf.length; i++) {
            idf[i] = Math.log10(list.length / idf[i]);
        }
        for (int j = 0; j < ansTable.length; j++) {
            for (int i = 0; i < idf.length; i++) {
                ansTable[j][i] *= idf[i];
            }
        }

    }

    void printMatrix(String[] cols) {
        System.out.print("Document:  ");
        for (String word : cols)
            System.out.print(word + "  ");
        System.out.println();
        for (int i = 0; i < ansTable.length; i++) {
            System.out.print("D" + (i + 1) + ":    ");
            for (int j = 0; j < ansTable[0].length; j++) {
                System.out.print(String.format("%.5f ", ansTable[i][j]));
            }
            System.out.println();
        }
    }
}

public class TfIdf {

    public static void main(String[] args) throws IOException {
        File[] files = new File[4];
        files[0] = new File("IFIDFTaskModule/file1.txt");
        files[1] = new File("IFIDFTaskModule/file2.txt");
        files[2] = new File("IFIDFTaskModule/file3.txt");
        files[3] = new File("IFIDFTaskModule/file4.txt");
        CleanText obj = new CleanText();

        String[][] data = new String[files.length][];
        for (int i = 0; i < files.length; i++) {
            try (Scanner sc = new Scanner(files[i])) {
                String str = "";
                while (sc.hasNextLine()) {
                    str += sc.nextLine() + " ";
                }
                str = obj.changetoLowerCase(str);
                str = obj.removepuctuation(str);
                data[i] = obj.stopWordRemoval(str);

            } catch (FileNotFoundException e) {
                System.out.println(e.getMessage());
            }
        }
        for (String[] i : data) {
            for (String j : i)
                System.out.println(j);
        }
        String[] uni = obj.uniqueWords(data);
        for (String i : uni)
            System.out.println(i);
        obj.ansTable = new double[data.length][uni.length];
        obj.getTF(uni, data);
        obj.getIDF(uni, data);
        obj.printMatrix(uni);

    }

}

0 replies

RohitNaikade1 · 2022-03-05T17:24:03Z

RohitNaikade1
Mar 5, 2022


import java.io.*;
import java.util.*;

public class TFIDF {

    static double count[][];
    static double tf[][];
    static double idf[];
    public static String readfiles(String file) throws FileNotFoundException {
        FileReader f1 = new FileReader(file);

        Scanner s1 = new Scanner(f1);
        String str1 = "";
        while (s1.hasNextLine()) {
            str1 += s1.nextLine();
        }

        return str1;
    }

    public static String removePuncts(String str) {
        String res = str.replaceAll("\\p{Punct}", "").toLowerCase();
        return res;
    }

    public static String tolower(String str) {
        return str.toLowerCase();
    }

    public static void findTF(ArrayList<String> set, ArrayList<ArrayList<String>> docs) {
        tf = new double[4][set.size()];
        
        count = new double[4][set.size()];
        for (int i = 0; i < 4; i++) {
            for (int j = 0; j < set.size(); j++) {
                ArrayList<String> list = docs.get(i);
                int cnt = 0;
                for (String tmp : list) {
                    if (set.get(j).equals(tmp)) {
                        cnt++;
                    }
                    count[i][j] = cnt;
                }
            }
        }

        for (int i = 0; i < 4; i++) {
            int cnt = 0;
            for (int j = 0; j < set.size(); j++) {
                if (count[i][j] == 1) {
                    cnt++;
                }
            }
            for (int j = 0; j < set.size(); j++) {
                if (count[i][j] == 1) {
                    tf[i][j] = (count[i][j]) / cnt;
                }
            }
        }
        System.out.println("TF Matrix is:");

        for (int i = 0; i < tf.length; i++) {
            for (int j = 0; j < tf[0].length; j++) {
                System.out.print(tf[i][j] + " ");
            }
            System.out.println();
        }
    }

    public static void main(String[] args) throws FileNotFoundException {

        // Reading files
        String str1 = readfiles("FileHandeling/file1.txt");
        String str2 = readfiles("FileHandeling/file2.txt");
        String str3 = readfiles("FileHandeling/file3.txt");
        String str4 = readfiles("FileHandeling/file4.txt");

        // Remove punctuations
        str1 = removePuncts(str1);
        str2 = removePuncts(str2);
        str3 = removePuncts(str3);
        str4 = removePuncts(str4);

        // convert to lower case
        str1 = tolower(str1);
        str2 = tolower(str2);
        str3 = tolower(str3);
        str4 = tolower(str4);

        // Tokenize the strings and remove stopwords
        ArrayList<String> first = tokenize(str1.split(" "));
        ArrayList<String> second = tokenize(str2.split(" "));
        ArrayList<String> third = tokenize(str3.split(" "));
        ArrayList<String> fourth = tokenize(str4.split(" "));

        Set<String> set = new HashSet<String>();
        set.addAll(first);
        set.addAll(second);
        set.addAll(third);
        set.addAll(fourth);
        ArrayList<String> temp = new ArrayList<String>();
        temp.addAll(set);

        ArrayList<ArrayList<String>> docs = new ArrayList<ArrayList<String>>();
        docs.add(first);
        docs.add(second);
        docs.add(third);
        docs.add(fourth);
        System.out.println(docs);
        findTF(temp, docs);
        findIDF(temp);

        // System.out.println("Count Matrix is:");

        // for (int i = 0; i < count.length; i++) {
        //     for (int j = 0; j < count[0].length; j++) {
        //         System.out.print(count[i][j] + " ");
        //     }
        //     System.out.println();
        // }
    }

    private static void findIDF(ArrayList<String> temp) {
        idf = new double[temp.size()];
        int k=0;

        for(int i=0; i<temp.size(); i++) {
            int cnt=0;
            for(int j=0; j<4; j++){
                if(count[j][i]==1.0){
                    cnt++;
                }
            }
           idf[k]=Math.log(4/cnt);k++;
        }

        System.out.println("IDF Matrix: ");
        for(int i=0; i<idf.length;i++){
            System.out.print(temp.get(i)+" ");
        }
        System.out.println("");
        for(int i=0; i<idf.length;i++){
            System.out.print(idf[i]+" ");
        }
    }

    private static ArrayList<String> tokenize(String str1[]) {
        ArrayList<String> stopwords = new ArrayList<String>(Arrays.asList("i", "is", "a", "me", "my", "myself", "we",
                "our",
                "ours", "ourselves", "you", "you’re", "you’ve", "you’ll", "you’d", "your", "yours", "yourself",
                "yourselves", "he", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so",
                "then", "too", "very", "s", "t", "can", "will", "just", "don", "don’t", "should", "should’ve", "now",
                "d", "ll", "m", "o", "re", "ve", "y", "ain", "aren’t", "could", "couldn’t", "didn’t", "didn’t", "of"));
        ArrayList<String> res = new ArrayList<>();
        for (String str : str1) {
            if (!stopwords.contains(str) && str.length() > 1) {
                res.add(str);
            }
        }
        return res;
    }
}

0 replies

ratnadeepikavuddagiri · 2022-03-06T02:07:43Z

ratnadeepikavuddagiri
Mar 6, 2022

package FileHandling.TF_IDF;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

public class TermFrequencyInverseDocumentFrequency {

    public static String[] stopWordRemoval(ArrayList<String> arrOfStr) {
        String[] stopWords = new String[] { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", 
                "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "most", "other",
                "some", "such", "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "is", "t",
                "can", "will", "just", "don", "don't", "should", "should've", "now", "d", "ll", "m", "o", "re", "ve",
                "y", "ain", "aren't", "could", "couldn't", "didn't", "didn't", "the", "of","a" };
        
        for(int i=0;i<arrOfStr.size();i++){
            for(int j=0;j<stopWords.length;j++) {
                if(arrOfStr.get(i).equals(stopWords[j]))
                    arrOfStr.remove(i);
            }
        }
        String cleanData[] = arrOfStr.toArray(new String[arrOfStr.size()]);
        // for(int i=0;i<cleanData.length;i++)
        // System.out.println(cleanData[i]);
        return cleanData;
    }

    static String[] dataCleaning(String data){
        
            data = data.replaceAll("\\p{Punct}","").toLowerCase();
            ArrayList<String> arrOfStr = new ArrayList<>();
            String[] dataclean =data.split(" ");
            for(int i=0;i<dataclean.length;i++){
                arrOfStr.add(dataclean[i]);
            }
            String[] cleanData = stopWordRemoval(arrOfStr);
            // for(int i=0;i<arrOfStr.size();i++){
            //     System.out.println(arrOfStr.get(i));
            // }
        
        return cleanData;
    }
    
    public static String[] getUniqueWords(String[][] cleanData){
        HashSet<String> uniqueWords = new HashSet<>();
        for (String[] row : cleanData) {
            for (String word : row) {
                uniqueWords.add(word);
            }
        }
        return uniqueWords.toArray(new String[uniqueWords.size()]);
    }

    public static void getTFMatrix(double[][] tfMatrix,String[][] cleanData, String[] uniqueWords){
        for (int i = 0; i < cleanData.length; i++) {
            for (int j = 0; j < uniqueWords.length; j++) {
                int count = 0;
                for (int k = 0; k < cleanData[i].length; k++) {
                    if (uniqueWords[j].equals(cleanData[i][k])) {
                        count++;
                    }
                }
                tfMatrix[i][j] = (double) count / cleanData[i].length;
            }
        }
    } 
    public static void getIDF(double[] idf,String[][] cleanData, String[] uniqueWords){

        int k=0;
        for (String uniqueWord : uniqueWords) {
            int count = 0;
            for (String[] row : cleanData) {
                for (String word : row) {
                    if (word.equals(uniqueWord)) {
                        count++;
                        break;
                    }
                }
            }
            idf[k++] = count;
        }
        int noOfDocuments = cleanData.length;
        for (int i = 0; i < idf.length; i++) {
            idf[i] = Math.log10(noOfDocuments / idf[i]);
        }
    }
    public static void getTF_IDF(double[][] tfidf,double[][] tf, double[] idf){
        for (int i = 0; i < tf.length; i++) {
            for (int j = 0; j < idf.length; j++) {
                tfidf[i][j] = tf[i][j]*idf[j];
            }
        }
    }

 public static void main(String[] args) {

    String[] paths = {"FileHandling/TF_IDF/file1.txt","FileHandling/TF_IDF/file2.txt","FileHandling/TF_IDF/file3.txt","FileHandling/TF_IDF/file4.txt"};
    Scanner sc;
    String[][] cleanData = new String[paths.length][];
    try {
        int i=0;
        for(String x : paths){
            File file = new File(x);
            sc = new Scanner(file);
            while (sc.hasNextLine())
                cleanData[i]=dataCleaning(sc.nextLine());
            i++;
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }
    // String[][] cleanData =dataCleaning(data);
    // for(int i=0;i<cleanData.length;i++){
    //     for(int j=0;j<cleanData[i].length;j++)
    //         System.out.print(cleanData[i][j]+" ");
    //         System.out.println();
    // }
    String[] uniqueWords = getUniqueWords(cleanData);
    double[][] tfMatrix = new double[paths.length][uniqueWords.length];
    getTFMatrix(tfMatrix, cleanData,uniqueWords);
    // for(int i=0;i<tfMatrix.length;i++){
    //     for(int j=0;j<tfMatrix[0].length;j++)
    //         System.out.print(tfMatrix[i][j]+" ");
    //         System.out.println();    
    // }
    double[] idf = new double[uniqueWords.length];
    getIDF(idf,cleanData,uniqueWords);
    // for(int i=0;i<idf.length;i++){
    //     System.out.print(idf[i]+" ");
    // }
    double[][] tfidf = new double[paths.length][uniqueWords.length];
    getTF_IDF(tfidf,tfMatrix,idf);
    // for(int i=0;i<tfidf.length;i++){
    //         for(int j=0;j<tfidf[0].length;j++)
    //             System.out.print(tfidf[i][j]+" ");
    //             System.out.println();    
    //     }
    System.out.print("Document ");
    for(int i=0;i<uniqueWords.length;i++){
        System.out.print(uniqueWords[i]+"  ");
    }
    System.out.println();
    for(int i=0;i<tfidf.length;i++){
        System.out.print("D"+(i+1)+"\t");
        for(int j=0;j<tfidf[0].length;j++)
            System.out.printf(" %.4f ",tfidf[i][j]);
        System.out.println();
    }
 }   
}

0 replies

PavanArvapally · 2022-03-06T15:17:52Z

PavanArvapally
Mar 6, 2022

package tf_idf;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

public class TfIdf {
    public static void main(String[] args) {
        File file1 = new File("Core-Java/tf_idf/file1.txt");
        File file2 = new File("Core-Java/tf_idf/file2.txt");
        File file3 = new File("Core-Java/tf_idf/file3.txt");
        File file4 = new File("Core-Java/tf_idf/file4.txt");
        ArrayList<String> str1 = cleanData(file1);
        ArrayList<String> str2 = cleanData(file2);
        ArrayList<String> str3 = cleanData(file3);
        ArrayList<String> str4 = cleanData(file4);
        List<List<String>> list = new ArrayList<>();
        list.add(str1);
        list.add(str2);
        list.add(str3);
        list.add(str4);

        Set<String> set = new HashSet<>();
        set.addAll(str1);
        set.addAll(str2);
        set.addAll(str3);
        set.addAll(str4);

        printTfIdf(list, set);
    }

    private static void printTfIdf(List<List<String>> list, Set<String> set) {
        double[][] tf = new double[list.size()][set.size()];
        List<Double> idf = new ArrayList<>();
        List<String> data = new ArrayList<>(set);
        for (int i = 0; i < data.size(); i++) {
            double count = 0;
            for (int j = 0; j < list.size(); j++) {
                double frequency = Collections.frequency(list.get(j), data.get(i));
                tf[j][i] = frequency / list.get(j).size();
                count += frequency;
            }

            idf.add(Math.log10(list.size() / count));
        }
        
        for (int i = 0; i < tf.length; i++) {
            for (int j = 0; j < tf[i].length; j++) {
                tf[i][j] *= idf.get(j);
                System.out.printf("%.3f\t", tf[i][j]);
            }
            System.out.println();
        }

    }

    private static ArrayList<String> cleanData(File file1) {
        String data = "";
        try (Scanner sc = new Scanner(file1)) {
            while (sc.hasNextLine()) {
                data += sc.nextLine();
            }
        } catch (FileNotFoundException e) {
            System.out.println(e);
        }
        data = data.replaceAll("\\p{Punct}", "");
        data = data.toLowerCase();

        return stopWordRemoval(data.split(" "));
    }

    private static ArrayList<String> stopWordRemoval(String[] words) {
        ArrayList<String> stopWords = new ArrayList<>(Arrays.asList("i", "is", "a", "me", "my", "myself", "we", "our",
                "ours", "ourselves", "you", "you’re", "you’ve", "you’ll", "you’d", "your", "yours", "yourself",
                "yourselves", "he", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so",
                "then", "too", "very", "s", "t", "can", "will", "just", "don", "don’t", "should", "should’ve", "now",
                "d", "ll", "m", "o", "re", "ve", "y", "ain", "aren’t", "could", "couldn’t", "didn’t", "didn’t", "of",
                "the", "by"));

        ArrayList<String> data = new ArrayList<>();

        for (String word : words) {
            if (!(stopWords.contains(word))) {
                data.add(word);
            }
        }
        return data;
    }
}

0 replies

Gaurav-Joshi-31 · 2022-03-06T16:04:53Z

Gaurav-Joshi-31
Mar 6, 2022

import java.util.*;
import java.io.*;
import java.util.stream.*;

public class TFIDF {
    public static String[] deletePunctuation(StringBuilder[] sdata) {
        String str = "‘!”#$%&'()*+,-./:;?@[]^_`{|}~’";
        String[] ndata = new String[sdata.length];
        for (int i = 0; i < ndata.length; i++) {
            StringBuilder data = sdata[i];
            for (int j = data.length() - 1; j >= 0; j--) {
                if (str.contains(data.charAt(j) + "")) {
                    data.deleteCharAt(j);
                }
            }
            ndata[i] = data.toString();
        }
        return ndata;
    }

    public static void convertToLowercase(String[] str) {
        for (int i = 0; i < str.length; i++) {
            str[i] = str[i].toLowerCase();
        }
    }

    public static List<List<String>> tokenize(String[] data) {
        List<List<String>> list = new ArrayList<>();
        for (String str : data) {
            String[] arr = str.split(" ");
            list.add(new ArrayList<>(Arrays.asList(arr)));
        }
        return list;
    }

    public static void removeStopWord(List<List<String>> list) {
        String stopWord = "i, a, is, me, by, my, of, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t";
        for (int i = 0; i < list.size(); i++) {
            List<String> l = list.get(i).stream().filter(word -> !stopWord.contains(word)).collect(Collectors.toList());
            list.set(i, l);
        }
    }

    public static String[] getUniqueWords(List<List<String>> list) {
        Set<String> set = new HashSet<>();
        for (List<String> l : list) {
            for (String str : l) {
                set.add(str);
            }
        }
        return set.toArray(new String[set.size()]);
    }

    public static double countWords(String word, List<String> words) {
        double ans = 0;
        for (String str : words) {
            if (str.equals(word)) {
                ans++;
            }
        }
        return ans;
    }

    public static double[] generateTFAndIdf(double[][] matrix, String[] uniqueWords, List<List<String>> list) {
        double[] idf = new double[uniqueWords.length];
        for (int j = 0; j < matrix[0].length; j++) {
            String word = uniqueWords[j];
            double noOfTimesTOccured = 0;
            for (int i = 0; i < matrix.length; i++) {
                List<String> words = list.get(i);
                matrix[i][j] = countWords(word, words) / words.size();
                if (matrix[i][j] > 0) {
                    noOfTimesTOccured++;
                }
            }
            idf[j] = Math.log10(matrix.length / noOfTimesTOccured);

        }
        return idf;
    }

    public static void printMatrix(double[][] matrix, String[] uniqueWords) {
        System.out.print("Document| ");
        for (String word : uniqueWords)
            System.out.print(word + " ");
        System.out.println();
        for (int i = 0; i < matrix.length; i++) {
            System.out.print("file" + (i + 1) + "   | ");
            for (int j = 0; j < matrix[0].length; j++) {
                System.out.print(String.format("%.3f ", matrix[i][j]));
            }
            System.out.println();
        }
    }

    public static void calculateTFIDF(double[][] matrix, double[] idf) {
        for (int j = 0; j < matrix[0].length; j++) {
            for (int i = 0; i < matrix.length; i++) {
                matrix[i][j] *= idf[j];
            }
        }
    }

    public static void main(String[] args) throws IOException {
        String[] files = { "file1.txt", "file2.txt", "file3.txt", "file4.txt" };
        StringBuilder[] data = new StringBuilder[files.length];

        for (int i = 0; i < files.length; i++) {
            Scanner s = new Scanner(new File(files[i]));
            while (s.hasNextLine()) {
                data[i] = new StringBuilder(s.nextLine());
            }
        }
        String[] dataWithoutPunctuation = deletePunctuation(data);
        convertToLowercase(dataWithoutPunctuation);
        List<List<String>> list = tokenize(dataWithoutPunctuation);
        removeStopWord(list);
        String[] uniqueWords = getUniqueWords(list);
        double[][] matrix = new double[files.length][uniqueWords.length];
        double[] idf = generateTFAndIdf(matrix, uniqueWords, list);
        calculateTFIDF(matrix, idf);
        printMatrix(matrix, uniqueWords);

    }

}

0 replies

Shubhi27 · 2022-03-06T16:40:25Z

Shubhi27
Mar 6, 2022

package tf_Idf_Task;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;

class DataCleaning {
    public static String removePunctuation(String s) {
        return s.replace("[", "").replace("]", "").replaceAll("[!\"”#$%&'()*+,-./:;?@^_`{|}~]", "");
    }

    public static String convertLowerCase(String s) {
        return s.toLowerCase();
    }

    public static String[] tokenize(String s) {
        return s.split(" ");
    }

    public static String[] removeStopWords(String s) {
        String[] stopWords = new String[] { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "a",
                "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "most", "other",
                "some", "such", "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "is", "t",
                "can", "will", "just", "don", "don't", "should", "should've", "now", "d", "ll", "m", "o", "re", "ve",
                "y", "ain", "aren't", "could", "couldn't", "didn't", "didn't", "the", "of" };
        String[] data = s.split(" ");
        int len = 0;
        for (int i = 0; i < data.length; i++) {
            for (String str : stopWords) {
                if (data[i].equals(str))
                    len++;
            }
        }
        String[] clean = new String[data.length - len];
        int count = 0;
        for (String i : data) {
            boolean flag = true;
            for (String str : stopWords) {
                if (i.equals(str)) {
                    flag = false;
                    break;
                }
            }
            if (flag)
                clean[count++] = i;
        }
        return clean;
    }

    public String[] uniqueWords(String[][] list) {
        Set<String> myset = new HashSet<>();
        for (String[] l : list) {
            for (String str : l) {
                myset.add(str);
            }
        }
        return myset.toArray(new String[myset.size()]);
    }

    public static void calculateTF(double[][] arr, String[][] files, String[] str) {
        for (int i = 0; i < files.length; i++) {
            for (int j = 0; j < str.length; j++) {
                int count = 0;
                String word = str[j];
                for (int k = 0; k < files[i].length; k++) {
                    if (word.equals(files[i][k])) {
                        count++;
                    }
                }
                arr[i][j] = (double) count / files[i].length;
            }
        }
    }

    public static double[] calculateIDF(String[][] files, String[] str) {
        double[] res = new double[str.length];
        int j = 0;

        for (String word : str) {
            int count = 0;
            for (String[] file : files) {
                for (String s : file) {
                    if (s.equals(word)) {
                        count++;
                        break;
                    }
                }
            }
            res[j] = count;
            j++;
        }
        int noOfDocuments = files.length;
        for (int i = 0; i < res.length; i++) {
            res[i] = Math.log10(noOfDocuments / res[i]);
        }
        return res;
    }

    public static void calculateTfIdf(double[][] arr, double[] idf) {
        for (int i = 0; i < arr.length; i++) {
            for (int j = 0; j < idf.length; j++) {
                arr[i][j] *= idf[j];
            }
        }
    }

    public static void printMatrix(double[][] arr, String[] words) {
        System.out.print("---Document---- ");
        for (String w : words)
            System.out.print(w + "  ");
        System.out.println();
        for (int i = 0; i < arr.length; i++) {
            System.out.print("---->file-" + (i + 1) + "  : ");
            for (int j = 0; j < arr[0].length; j++) {
                System.out.printf("%.3f\t", arr[i][j]);
            }
            System.out.println();
        }
    }
}

public class TfIdfDriver {
    public static void main(String[] args) {
        File f1 = new File("tf_Idf_Task/file1.txt");
        File f2 = new File("tf_Idf_Task/file2.txt");
        File f3 = new File("tf_Idf_Task/file3.txt");
        File f4 = new File("tf_Idf_Task/file4.txt");
        File[] files = new File[] { f1, f2, f3, f4 };

        DataCleaning d = new DataCleaning();
        String[][] cleanData = new String[files.length][];

        for (int i = 0; i < files.length; i++) {
            try (Scanner sc = new Scanner(files[i])) {
                String str = "";
                while (sc.hasNextLine()) {
                    str = sc.nextLine();
                    str = DataCleaning.removePunctuation(str);
                    str = DataCleaning.convertLowerCase(str);
                    cleanData[i] = DataCleaning.removeStopWords(str);
                }

            } catch (FileNotFoundException e) {
                System.out.println("File not found: " + e.getMessage());
            }

        }

        String[] unique = d.uniqueWords(cleanData);
        double[][] matrix = new double[cleanData.length][unique.length];
        d.calculateTF(matrix, cleanData, unique);
        double[] idf = d.calculateIDF(cleanData, unique);
        d.calculateTfIdf(matrix, idf);
        d.printMatrix(matrix, unique);

    }
}

0 replies

shreiyarandive · 2022-03-06T18:15:16Z

shreiyarandive
Mar 6, 2022

package week4.WeekendTask;

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.Set;

public class TFIDF {

    public static String removePunctuations(String s) {
        return s.replaceAll("\\p{Punct}", "");
    }

    public static String lowerCase(String s) {
        return s.toLowerCase();
    }

    public static String[] tokenise(String s) {
        return s.split(" ");
    }

    public static ArrayList<String> removeStopWords(String[] s) {
        List<String> stopWords = Arrays.asList(
                "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "a",
                "you're", "you've", "you'll", "you'd", "your", "yours", "yourself",
                "yourselves", "he", "most", "other",
                "some", "such", "no", "nor", "not", "only", "own", "same", "so", "then",
                "too", "very", "is", "t",
                "can", "will", "just", "don", "don't", "should", "should've", "now", "d",
                "ll", "m", "o", "re", "ve",
                "y", "ain", "aren't", "could", "couldn't", "didn't", "didn't", "the", "of");

        ArrayList<String> string = new ArrayList<String>(Arrays.asList(s));

        string.removeAll(stopWords);

        return string;

    }

    public static double calculateTF(int count, int size) {
        return count / (double) size;
    }

    public static double calculateIDF(int noOfDocument, int count) {
        return Math.log10((noOfDocument / (double) count));
    }

    public static void printMatrix(List<String> list) {

        System.out.println(list);
    }

    public static void main(String[] args) {

        String[] files = { "Week4/WeekendTask/file1.txt", "Week4/WeekendTask/file2.txt", "Week4/WeekendTask/file3.txt",
                "Week4/WeekendTask/file4.txt" };

        List<String> list = new ArrayList<String>();

        for (int i = 0; i < files.length; i++) {
            try (Scanner s = new Scanner(new FileReader(files[i]))) {
                String d = s.nextLine();

                String puncRemoved = removePunctuations(d);

                String lowerCase = lowerCase(puncRemoved);

                String[] tokenString = tokenise(lowerCase);

                list.addAll(removeStopWords(tokenString));

            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }
        }

        int size = list.size();

        Set<String> unique = new HashSet<String>(list);

        Map<String, Double> map = new HashMap<>();

        for (String key : unique) {
            int count = Collections.frequency(list, key);
            double tf = calculateTF(count, size);
            double idf = calculateIDF(4, count);
            map.put(key, (tf * idf));
        }

        printMatrix(list);

    }

}

0 replies

Amrit-Raj22 · 2022-03-06T19:57:38Z

Amrit-Raj22
Mar 6, 2022

package Custom;

import java.io.*;
import java.util.*;

class FindIfIdf {

    String operations(String text) {
        return text.replaceAll("[!\"”#$%&'()*+,-./:;?@^_`{|}~]", "").toLowerCase();
    }

    String[] stopWordRemoval(String s) {
        String[] stopWords = new String[] { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "a",
                "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "most", "other",
                "some", "such", "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "is", "t",
                "can", "will", "just", "don", "don't", "should", "should've", "now", "d", "ll", "m", "o", "re", "ve",
                "y", "ain", "aren't", "could", "couldn't", "didn't", "didn't", "the", "of" };
        String[] str2 = s.split(" ");
        int size = 0;
        for (int i = 0; i < str2.length; i++) {
            for (String j : stopWords) {
                if (str2[i].equals(j))
                    size++;
            }
        }
        String[] res = new String[str2.length - size];
        int i = 0;
        for (String st : str2) {
            // System.out.print(st+"\t");
            boolean stat = true;
            for (String j : stopWords) {
                if (st.equals(j)) {
                    stat = false;
                    break;
                }
                // System.out.println("in loop2 \t");
            }
            if (stat)
                res[i++] = st;
        }
        return res;
    }

    String[] uniqueWords(String[][] list) {
        Set<String> set = new HashSet<>();   //////
        for (String[] i : list) {
            for (String str : i) {
                set.add(str);
            }
        }
        return set.toArray(new String[set.size()]);
    }

    double strres[][];

    void getTF(String[] result, String[][] list) {
        for (int i = 0; i < list.length; i++) {
            for (int j = 0; j < result.length; j++) {
                int count = 0;
                String s = result[j];
                for (int k = 0; k < list[i].length; k++) {
                    if (s.equals(list[i][k])) {
                        count += 1;
                    }
                }
                // System.out.println(count);
                strres[i][j] = count / (double) list.length;

            }

        }

    }

    void getIDF(String[] result, String[][] list) {
        getTF(result, list);
        double[] idf = new double[result.length];
        int p = 0;
        for (String i : result) {
            int c = 0;
            for (String[] string : list) {
                for (String j : string) {

                    if (j.equals(i)) {
                        c++;
                        break;
                    }
                }
            }
            idf[p++] = c;
        }
        for (int i = 0; i < idf.length; i++) {
            idf[i] = Math.log10(list.length / idf[i]);
        }
        for (int j = 0; j < strres.length; j++) {
            for (int i = 0; i < idf.length; i++) {
                strres[j][i] *= idf[i];
            }
        }

    }

    void printMatrix(String[] cols) {
        System.out.print("Docs:  ");
        for (String word : cols)
            System.out.print(word + "   ");
        System.out.println();
        for (int i = 0; i < strres.length; i++) {
            System.out.print("D" + (i + 1) + ":    ");
            for (int j = 0; j < strres[0].length; j++) {
                System.out.print(String.format("%.4f ", strres[i][j]));
            }
            System.out.println();
        }
    }
}

public class TfIdf {

    public static void main(String[] args) throws IOException {

        File[] files = new File[4];
        files[0] = new File("Custom/file1.txt");
        files[1] = new File("Custom/file2.txt");
        files[2] = new File("Custom/file3.txt");
        files[3] = new File("Custom/file4.txt");

        FindIfIdf obj = new FindIfIdf();
        String[][] str2 = new String[files.length][];
        for (int i = 0; i < files.length; i++) {
            try (Scanner sc = new Scanner(files[i])) {
                String str = "";
                while (sc.hasNextLine()) {
                    str += sc.nextLine() + " ";
                }
                str = obj.operations(str);
                str2[i] = obj.stopWordRemoval(str);

            } catch (FileNotFoundException e) {
                System.out.println(e.getMessage());
            }
        }
        String[] result = obj.uniqueWords(str2);
        obj.strres = new double[str2.length][result.length];
        obj.getTF(result, str2);
        obj.getIDF(result, str2);
        obj.printMatrix(result);

    }

}

0 replies

ujjwal18bit1167 · 2022-03-10T12:01:16Z

ujjwal18bit1167
Mar 10, 2022

import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

public class TfIdf {
    public static String removePunctuation(String s) {
        return s.replace("[", "").replace("]", "").replaceAll("[!\"”#$%&'()*+,-./:;?@^_`{|}~]", "");
    }

    public static String[] removeStopWords(String s) {
        String[] stopWords = new String[] { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "a",
                "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "most", "other",
                "some", "such", "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "is", "t",
                "can", "will", "just", "don", "don't", "should", "should've", "now", "d", "ll", "m", "o", "re", "ve",
                "y", "ain", "aren't", "could", "couldn't", "didn't", "didn't", "the", "of" };
        String[] tokens = s.split(" ");
        String[] ans;
        int count = 0;
        for (int i = 0; i < tokens.length; i++) {
            for (String str : stopWords) {
                if (tokens[i].equals(str)) {
                    count++;
                }
            }
        }
        ans = new String[(tokens.length - count)];
        int j = 0;
        for (int i = 0; i < tokens.length; i++) {
            boolean flag = true;
            for (String str : stopWords) {
                if (tokens[i].equals(str)) {
                    flag = false;
                    break;
                }
            }
            if (flag) {
                ans[j] = tokens[i];
                j++;
            }
        }
        return ans;
    }

    public static String[][] cleanData(File[] f) {

        String[][] data = new String[f.length][];
        for (int i = 0; i < f.length; i++) {
            try (Scanner fReader = new Scanner(f[i])) {
                String str = "";
                while (fReader.hasNextLine()) {
                    str += fReader.nextLine() + " ";
                }

                str = removePunctuation(str);
                str = str.toLowerCase();
                data[i] = removeStopWords(str);

            } catch (FileNotFoundException e) {
                System.out.println(e.getMessage());
            }
        }
        return data;
    }

    public static String[] getUniqueWords(String[][] f) {
        Set<String> set = new HashSet<>();
        for (String[] file : f) {
            for (String str : file) {
                set.add(str);
            }
        }
        return set.toArray(new String[set.size()]);
    }

    public static void calculateTF(double[][] matrix, String[][] cleanFiles, String[] uniques) {
        for (int i = 0; i < cleanFiles.length; i++) {
            for (int j = 0; j < uniques.length; j++) {
                int count = 0;
                String word = uniques[j];
                for (int k = 0; k < cleanFiles[i].length; k++) {
                    if (word.equals(cleanFiles[i][k])) {
                        count++;
                    }
                }
                matrix[i][j] = (double) count / cleanFiles[i].length;
            }
        }
    }

    public static double[] calculateIDF(String[][] f, String[] unique) {
        double[] freq = new double[unique.length];
        int j = 0;

        for (String word : unique) {
            int count = 0;
            for (String[] file : f) {
                for (String wordinFile : file) {
                    if (wordinFile.equals(word)) {
                        count++;
                        break;
                    }
                }
            }
            freq[j] = count;
            j++;
        }
        int noOfDocuments = f.length;
        for (int i = 0; i < freq.length; i++) {
            freq[i] = Math.log10(noOfDocuments / freq[i]);
        }
        return freq;
    }

    public static void calculateTFIDF(double[][] matrix, double[] idf) {
        for (int i = 0; i < matrix.length; i++) {
            for (int j = 0; j < idf.length; j++) {
                matrix[i][j] *= idf[j];
            }
        }
    }

    public static void printMatrix(double[][] matrix, String[] uniqueWords) {
        System.out.print("Document| ");
        for (String word : uniqueWords)
            System.out.print(word + "\t\t\t");
        System.out.println();
        for (int i = 0; i < matrix.length; i++) {
            System.out.print("file" + (i + 1) + "   | ");
            for (int j = 0; j < matrix[0].length; j++) {
                System.out.printf("%.5f\t\t\t", matrix[i][j]);
            }
            System.out.println();
        }
    }

    public static void main(String[] args) {
        File f1 = new File("file1.txt");
        File f2 = new File("file2.txt");
        File f3 = new File("file3.txt");
        File f4 = new File("file4.txt");
        File[] file = new File[] { f1, f2, f3, f4 };

        String[][] cleanFiles = cleanData(file);
        String[] unique = getUniqueWords(cleanFiles);

        double[][] matrix = new double[cleanFiles.length][unique.length];

        calculateTF(matrix, cleanFiles, unique);
        double[] idf = calculateIDF(cleanFiles, unique);

        calculateTFIDF(matrix, idf);

        printMatrix(matrix, unique);

    }
}

0 replies

Krishreddy460 · 2022-03-13T09:51:41Z

Krishreddy460
Mar 13, 2022

import java.util.;
import java.io.;

class StringsOperations
{
public static String[] removePunctuation(String[] arraOfStrings)
{
String p="!@#$%^&**()_+}{:?><";
String[] result =new String[arraOfStrings.length];
for(int i=0;i<arraOfStrings.length;i++)
{
String temp=arraOfStrings[i];
String tempp="";

        for(int j=0;j<temp.length();j++)
        {
            if(p.contains(temp.charAt(j)+""))
            {
                tempp+="";
                
            }
            else
            {
                tempp+=temp.charAt(j);
            }
        }
        result[i]=tempp;
    }
    return result;

}

public static String[] Lowercase(String[] str) 
{
    for (int i = 0; i < str.length; i++) {
        str[i] = str[i].toLowerCase();
    }
    return str;
}

public static List<List<String>> tokenization(String[] data) {
    List<List<String>> list = new ArrayList<>();
    for (String str : data) {
        String[] arr = str.split(" ");
        list.add(new ArrayList<>(Arrays.asList(arr)));
    }
    return list;
}

public static String[] getUniqueWords(List<List<String>> list) {
    Set<String> set = new HashSet<>();
    for (List<String> l : list) {
        for (String str : l) {
            set.add(str);
        }
    }
    return set.toArray(new String[set.size()]);
}

public static List<List<String>> stopWordsRemoval(List<List<String>> list) 
{   
    String stopWord = "i, a, is, me, by, my, of, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t";
    List<List<String>> l=new ArrayList<List<String>>();
    List<List<String>> r=new ArrayList<List<String>>();
    for(int i = 0; i < list.size(); i++)
    {
        List<String> li=list.get(i);
        List<String> newAdd=new ArrayList<String>();
        for(int j = 0; j < li.size(); j++)
        {
            if(!(stopWord.contains(li.get(j))))
            {
                newAdd.add(li.get(j));
            }
        }
        r.add(newAdd);

    }
    return r;
}
public static int uniquecountWords(String word, List<String> words) 
{
    int ans = 0;
    for (String str : words) {
        if (str.equals(word)) {
            ans++;
        }
    }
    return ans;
}

public static double[] generateTFAndIdf(double[][] matrix, String[] uniqueWords, List<List<String>> list) {
    double[] idf = new double[uniqueWords.length];
    for (int j = 0; j < matrix[0].length; j++) {
        String word = uniqueWords[j];
        double noOfTimesTOccured = 0;
        for (int i = 0; i < matrix.length; i++) {
            List<String> words = list.get(i);
            matrix[i][j] = uniquecountWords(word, words) / words.size();
            if (matrix[i][j] > 0) {
                noOfTimesTOccured++;
            }
        }
        idf[j] = Math.log10(matrix.length / noOfTimesTOccured);

    }
    return idf;
}

public static void printMatrix(double[][] matrix, String[] uniqueWords) {
    System.out.print("Document| ");
    for (String word : uniqueWords)
        System.out.print(word + " ");
    System.out.println();
    for (int i = 0; i < matrix.length; i++) {
        System.out.print("file" + (i + 1) + "   | ");
        for (int j = 0; j < matrix[0].length; j++) {
            System.out.print(String.format("%.3f ", matrix[i][j]));
        }
        System.out.println();
    }
}

public static void calculateTFIDF(double[][] matrix, double[] idf) {
    for (int j = 0; j < matrix[0].length; j++) {
        for (int i = 0; i < matrix.length; i++) {
            matrix[i][j] *= idf[j];
        }
    }
}

}
public class StringsDemo
{
public static void main(String[] args) throws FileNotFoundException
{

    String[] files = { "file1.txt", "file2.txt", "file3.txt", "file4.txt" };
    String[] data = new String[files.length];

    for (int i = 0; i < files.length; i++) {
        Scanner s = new Scanner(new File(files[i]));
        while (s.hasNextLine()) 
        {
            data[i] = new String(s.nextLine());
        }

        String[] afterRemovalofpunctations=StringsOperations.removePunctuation(data);
             System.out.println(Arrays.toString(afterRemovalofpunctations));

    String[] afterLowering=StringsOperations.Lowercase(afterRemovalofpunctations);
            System.out.println(Arrays.toString(afterLowering));

    List<List<String>> afterTokenizer=StringsOperations.tokenization(afterLowering);
            System.out.println(afterTokenizer);

    List<List<String>> afterStopWords=StringsOperations.stopWordsRemoval(afterTokenizer);
            System.out.println(afterStopWords);

    String[] getUniqueWords=StringsOperations.getUniqueWords(afterStopWords);
            System.out.println(getUniqueWords);
    

    double[][] matrix = new double[files.length][getUniqueWords.length];
    double[] idf = StringsOperations.generateTFAndIdf(matrix, getUniqueWords, afterStopWords);
                StringsOperations.calculateTFIDF(matrix, idf);
                StringsOperations.printMatrix(matrix, getUniqueWords); 

    }



    




}

}

0 replies

KaminiJha · 2022-03-13T14:04:04Z

KaminiJha
Mar 13, 2022

package WeekendTasks;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Scanner;

public class TfIdf {
    public static String puncRem(String content) {
        return content.replaceAll("\\p{Punct}", "");
    }

    public static String[] removeStopWord(String content) {
        String[] stopwords = { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "you’re", "you’ve",
                "you’ll", "you’d", "your", "yours", "yourself", "yourselves", "he", "most", "other", "some", "such",
                "no", "nor", "not", "only", "own", "same", "so", "then", "too", "very", "s", "t", "can", "will", "just",
                "don", "don’t", "should", "should’ve", "now", "d", "ll", "m", "o", "re", "ve", "y", "ain", "aren’t",
                "could", "couldn’t", "didn’t", "didn’t" };
        String[] modifiedContent = content.split(" ");
        List<String> arr = new ArrayList<>(Arrays.asList(modifiedContent));
        for (int i = 0; i < arr.size(); i++) {
            for (String s : stopwords) {
                if (arr.get(i).equals(s)) {
                    arr.remove(i);
                }
            }
        }
        return arr.toArray(new String[arr.size()]);
    }

    public static String[] getUniqueWords(String[][] content) {
        HashSet<String> set = new HashSet<>();
        for (String[] file : content) {
            for (String s : file) {
                set.add(s);
            }
        }
        return set.toArray(new String[set.size()]);
    }

    public static void tf(double[][] a, String[] str, String[][] content) {
        for (int i = 0; i < content.length; i++) {
            for (int j = 0; j < str.length; j++) {
                int count = 0;
                String word = str[j];
                for (int k = 0; k < content[i].length; k++) {
                    if (word.equals(content[i][k])) {
                        count++;
                    }
                }
                a[i][j] = (double) count / content[i].length;
            }
        }
    }

    public static double[] idf(String[][] content, String[] s) {
        double[] ans = new double[s.length];
        int j = 0;
        for (String word : s) {
            int count = 0;
            for (String[] file : content) {
                for (String i : file) {
                    if (i.equals(word)) {
                        count++;
                        break;
                    }
                }
            }
            ans[j] = count;
            j++;
        }
        int noOfDocuments = content.length;
        for (int i = 0; i < ans.length; i++) {
            ans[i] = Math.log10(noOfDocuments / ans[i]);
        }
        return ans;
    }

    public static void tfidf(double[][] matrix, double[] idf) {
        for (int i = 0; i < matrix.length; i++) {
            for (int j = 0; j < idf.length; j++) {
                matrix[i][j] *= idf[j];
            }
        }
    }

    public static void displayMat(double[][] matrix, String[] uniqueWords) {
        for (String word : uniqueWords)
            System.out.print(word + "  ");
        System.out.println();
        for (int i = 0; i < matrix.length; i++) {
            System.out.print("file" + (i + 1) + "  : ");
            for (int j = 0; j < matrix[0].length; j++) {
                System.out.printf("%.2f ", matrix[i][j]);
            }
            System.out.println();
        }
    }

    public static void main(String[] args) {
        String[] files = { "WeekendTasks/file1.txt", "WeekendTasks/file2.txt", "WeekendTasks/file2.txt",
                "WeekendTasks/file4.txt" };
        String[][] data = new String[files.length][];
        for (int i = 0; i < files.length; i++) {
            try (Scanner sc = new Scanner(new File(files[i]))) {
                while (sc.hasNext()) {
                    String s = sc.nextLine();
                    s = puncRem(s).toLowerCase();
                    data[i] = removeStopWord(s);

                }
            } catch (FileNotFoundException e) {
                System.out.println(e);
            }
        }
        String[] unique = getUniqueWords(data);
        double[][] mat = new double[data.length][unique.length];
        tf(mat, unique, data);
        double[] idf = idf(data, unique);
        tfidf(mat, idf);
        System.out.println("************MATRIX*************");
        displayMat(mat, unique);

    }

}

0 replies

rakeshGit419 · 2022-04-08T10:30:21Z

rakeshGit419
Apr 8, 2022

import java.io.*;
import java.util.*;
import java.lang.reflect.Array;
import java.text.DecimalFormat;


public class TFandIDFAnalysis {
    public static void writeInFile(String s1, String s2, String s3, String s4,String file1, String file2, String file3, String file4) {
        try (FileWriter fw1 = new FileWriter(file1);FileWriter fw2 = new FileWriter(file2);
            FileWriter fw3 = new FileWriter(file3);FileWriter fw4 = new FileWriter(file4);){
            fw1.write(s1);fw2.write(s2);fw3.write(s3);fw4.write(s4);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static String[] readFromFile(String file1, String file2, String file3, String file4) {
        try(FileReader fr1 = new FileReader(file1);FileReader fr2 = new FileReader(file2);
                FileReader fr3 = new FileReader(file3);FileReader fr4 = new FileReader(file4);) {
            String[] res = new String[4];
            int i = 0;
            for(String s : res) 
                res[i++] = "";
            int ch;
            while ((ch = fr1.read()) != -1) res[0] += (char) ch;
            while ((ch = fr2.read()) != -1) res[1] += (char) ch;
            while ((ch = fr3.read()) != -1) res[2] += (char) ch;
            while ((ch = fr4.read()) != -1) res[3] += (char) ch;
            return res;
            }
            catch (IOException e) {
                e.printStackTrace();
                return new String[0];
            }
    }

    public static List<String> filterString(String s) {
        String[] strArr = s.replaceAll("\\p{Punct}", "").toLowerCase().split("\\s");
        strArr = removeString(strArr);
        return Arrays.asList(strArr);
    }

    public static String[] removeString(String[] s) {
        String stopWords = "i, me, my, myself, we, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t, the, a, by, is, of";
        String[] arr = stopWords.split(", ");
        for (String str1 : arr) {
            for (String str2 : s)
                if (str2.equals(str1)) {
                    s = remove(s, str1);
                }
        }
        return s;
    }

    public static String[] remove(String[] array, String ele) {
    if (array.length > 0) {
        int idx = -1;
        for (int i = 0; i < array.length; i++) {
            if (array[i].equals(ele)) {
                idx = i;
                break;
            }
        }
        if (idx >= 0) {
            String[] copy = (String[]) Array.newInstance(array.getClass()
                    .getComponentType(), array.length - 1);
            if (copy.length > 0) {
                System.arraycopy(array, 0, copy, 0, idx);
                System.arraycopy(array, idx + 1, copy, idx, copy.length - idx);
            }
            return copy;
        }
    }
    return array;
}
    public static Set<String> getCommWords(List<List<String>> input) {
        Set<String> res = new HashSet<>();
        for(List<String> s : input) {
             for (String x : s)
                    res.add(x);
        }
        return res;
    }

    public static double[][] freqArray(Set<String> setOfCommonWords, List<List<String>> listofString) {
        double[][] tf = new double[4][setOfCommonWords.size()];
        List<String> commonWords = new ArrayList<>();
        commonWords.addAll(setOfCommonWords);
        String[][] strings = new String[4][setOfCommonWords.size()];
        int i = 0, j = 0;
        for (List<String> line : listofString)
            strings[i++] = line.toArray(new String[0]);
        for (i = 0; i < 4; i++) 
            for (j = 0; j < setOfCommonWords.size(); j++) 
                if (Arrays.asList(strings[i]).contains(commonWords.get(j))) 
                    tf[i][j]++;
        return tf;
    }

    public static double[][] normalizedTFtable(double[][] tf, int row, int col) {
        double[] sumRow = new double[row];
        for (int i = 0; i < row; i++) {
            for (int j = 0; j < col; j++) {
                sumRow[i] = sumRow[i] + tf[i][j];
            }
        }
        for (int i = 0; i < row; i++) {
            for (int j = 0; j < col; j++) {
                tf[i][j] /= sumRow[i];
            }
        }
        return tf;
    }

    public static double[] getIDFandDisplayIDF(double[][]  tfAndIdf, int row, int col,Set<String> setOfCommonWords){
        double[] IDF = new double[col];
        for (int i = 0; i < col; i++) {
            for (int j = 0; j < row; j++) 
                IDF[i] += tfAndIdf[j][i];
        }
        DecimalFormat df = new DecimalFormat("#0.0000");
            for (String words : setOfCommonWords) 
                System.out.print(words + "\t");
            for (int i = 0; i < col ; i++) {
                IDF[i] = Math.log10(row / IDF[i]);
                System.out.print(df.format(IDF[i]) + "\t");
            }
            System.out.println("\n");
        return IDF;
    }

    public static void displayTable(double[][] tf, int n, int m, Set<String> setOfCommonWords) {
        DecimalFormat df = new DecimalFormat("#0.00000");
        System.out.print("D/W\t");
        for (String words : setOfCommonWords) {
            if(words.length() > 7)
                System.out.print(words + "\t");
            else    
                System.out.print(words + "\t\t");
        }
        System.out.println();
        for (int i = 0; i < n; i++) {
            System.out.print("D " + (i + 1) + ":\t");
            for (int j = 0; j < m; j++) 
                System.out.print(df.format(tf[i][j]) + "\t\t");
            System.out.println();
        }
    }

    public static double[][] getTFandIDF(double[][] tfAndIdf, double[] idf, int row, int col) {
        for (int j = 0; j < col; j++) {
            for (int i = 0; i < row; i++)
                tfAndIdf[i][j] *= idf[j];
        }
        return tfAndIdf;
    }
    public static void main(String[] args) {
        String s1 = "My name is Groot.",s2 = "My ship is a Starship.",s3 = "The ship I drive, is, a Starship Benatar.",s4 = "My ship is a Benatar model of Starship !!";
        String file1 = "week4/file1.txt",file2 = "week4/file2.txt",file3 = "week4/file3.txt",file4 = "week4/file4.txt";
        writeInFile(s1,s2,s3,s4,file1,file2,file3,file4);

        String[] stotal = readFromFile(file1,file2,file3,file4);
        List<List<String>> filteredString = new ArrayList<>();
        for(String s : stotal)
            filteredString.add(filterString(s));
        /**
         * Finding commom words:
         */
        Set<String> setOfCommonWords = new HashSet<>();
        setOfCommonWords = getCommWords(filteredString);
        /**
         * Freqency of occurrences:
         */
        double[][] tf = new double[4][setOfCommonWords.size()];
        tf = freqArray(setOfCommonWords, filteredString);
        displayTable(tf, 4, setOfCommonWords.size(), setOfCommonWords);

        double[] idf = new double[setOfCommonWords.size()];
        idf = getIDFandDisplayIDF(tf, 4, setOfCommonWords.size(),setOfCommonWords);
        normalizedTFtable(tf, 4, setOfCommonWords.size());
        System.out.println("\n --- Normalized TF table ---- \n");
        displayTable(tf, 4, setOfCommonWords.size(), setOfCommonWords);
        /**
         * FINAL TS AND IDF table:
         */
        getTFandIDF(tf, idf, 4, setOfCommonWords.size());
        System.out.println("\n------ FINAL TF and IDF Table: ------\n");
        displayTable(tf, 4, setOfCommonWords.size(), setOfCommonWords);
    }
}

0 replies

VigneshVishwa · 2022-04-13T03:19:01Z

VigneshVishwa
Apr 13, 2022

import java.util.*;
import java.io.File;
import java.io.IOException;

public class Tfidf {
    public static String removePunc(String str) {
        str = str.replaceAll("[^a-zA-Z0-9\\s]", "");
        return str;

    }

    public static String loweringCase(String str) {
        str = str.toLowerCase();
        return str;
    }

    public static String[] tokenization(String str) {
        return str.split(" ");
    }

    public static StringBuilder stopWord(String[] newStr) {

        String stopWord = "i, a, is, me, by, my, of, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t";
        StringBuilder sb = new StringBuilder();
        for (String word : newStr) {
            if (!stopWord.contains(word)) {
                sb.append(word + " ");
            }
        }
        return sb;

    }

    public static void uniqueWords(String[] s) {
        Set<String> uniqueWords = new HashSet<>();
        for (String str : s)
            uniqueWords.add(str);
        matCreation(uniqueWords);

    }

    public static void matCreation(Set<String> uniqueWords) {
    String[][] mat = new String[5][uniqueWords.size() + 1];
    mat[0][0] = "Document";
    List<String> list = new ArrayList<>(uniqueWords);
    for (int i = 1; i < 4; i++) {
    for (int j = 0; j <= uniqueWords.size(); j++) {
    if (j == 0 && i != 0) {
    int x = i + 1;
    mat[i][j] = "file" + x;
    } else if (i == 0 && j != 0) {
    mat[i][j] = list.get(i);
    // for (String str : s) {
    // mat[i][j] = str;
    // }
    } else
    mat[i][j] = "0";
    }

    }
    for (int i = 0; i < 4; i++) {
    for (int j = 0; j <= uniqueWords.size(); j++) {
    System.out.print(mat[i][j] + " ");
    }
    System.out.println();
    }

    }

    // public static void tf()

    public static void idf(String[] allWords, Set<String> unique) {
        for (String s : unique) {
            int count = 0;
            for (String word : allWords) {
                if (s.equals(word)) {
                    count++;
                }
            }
            double idf = 0;
            idf = Math.round((Math.log10(4.0 / count) * 1000.00)) / (double) 1000;
            System.out.println(s + " " + idf);

        }

    }

    public static void main(String[] args) throws IOException {
        String[] files = { "src/File1.txt", "src/File2.txt", "src/File3.txt", "src/File4.txt" };
        StringBuilder newSb = new StringBuilder();
        for (String file : files) {
            File obj = new File(file);
            Scanner reader = new Scanner(obj);
            while (reader.hasNextLine()) {
                String line = reader.nextLine();

                newSb.append(stopWord(tokenization(loweringCase(removePunc(line)))));
            }
            reader.close();
        }
        String[] words = newSb.toString().split(" ");
        uniqueWords(words);

    }

}

0 replies

Weekend Task - String Programs - TF-IDF #40

Uh oh!

Uh oh!

akash-coded Feb 24, 2022 Maintainer

TF-IDF

Term Frequency - Inverse Document Frequency

Steps to clean the data

Calculation

A Step-by-Step Example

Task

Replies: 21 comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akash-coded
Feb 24, 2022
Maintainer